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PREFACE 


This  report  describes  the  deliberations  of  a workshop  held  at  The 
Rand  Corporation  on  March  9-10,  1977  to  study  the  feasibility  of  a 
special-purpose  computer  designed  to  solve  the  Navier-Stokes  equations — 
the  basic  expressions  describing  fluid  flow.  It  includes  the  results 
of  a preliminary  Rand  study  of  such  a computer,  used  as  a "straw  man" 
by  the  workshop  participants. 

Interest  in  fluid  mechanics,  coupled  with  early  and  continuing 
research  in  the  computer  sciences,  led  Rand  to  sponsor  this  study  using 
corporate  research  funds.  The  report  should  interest  researchers  in 
the  fields  of  computer  science,  numerical  analysis,  and  fluid  mechanics, 
particularly  persons  concerned  with  the  application  of  high-speed  com- 
putational techniques  to  the  solution  of  the  governing  equations  of 
fluid  flow. 


SUMMARY 


This  report  summarizes  the  results  of  a two-day  workshop  at  The 
Rand  Corporation,  Santa  Monica,  where  participants  discussed  the  feasi- 
bility of  developing  a special-purpose  computer  to  solve  the  Navier- 
Stokes  equations  for  a fairly  general  class  of  fluid  mechanics  problems. 
It  includes  the  text  of  a preliminary  Rand  study  of  such  a computer, 
which  the  workshop  participants  used  as  a straw  man. 

PRELIMINARY  RAND  STUDY 

In  the  fall  of  1976  and  winter  of  1977,  a preliminary  concept  for 
a Navier-Stokes  computer  was  developed  at  Rand.  This  concept  consists 
of  an  array  of  10,000  identical  large-scale  integrated  circuits  arranged 
in  a 100  * 100  matrix.  To  reduce  wiring  congestion,  each  processor  is 
limited  to  communication  with  its  nearest  neighbors.  Each  processor 
includes  some  space  for  storing  algorithms  and  carrying  out  simple  cal- 
culations. Three-dimensional  problems  are  solved  by  having  each  pro- 
cessor represent  a point  in  a two-dimensional  plane.  The  storage  on  the 
processor  is  used  to  represent  data  in  the  third  dimension. 

The  potential  benefits  of  this  design  are  enormous.  Each  process- 
ing element  can  carry  out  a 64-bit  fixed-point  multiplication  in  five 
microseconds.  Though  this  is  quite  slow  compared  with  the  speed  of 
today's  best  computers,  an  array  of  10,000  such  elements  performs  simul- 
taneously 10,000  multiplications  in  that  time,  or  approximately  two 
billion  multiplications  per  second.  Given  that  integrated  circuits  of 
the  type  described  may  be  produced  in  the  near  future  for  about  $100 
per  chip  or  less,  this  architecture  concept  can  lead  to  a very  large 
performance/cost  ratio. 

An  entire  large  problem  (approximately  10^  points)  can  fit  on  this 
computer  at  one  time.  Thus,  there  would  be  no  need  to  break  the  solu- 
tion up  into  sequential  pieces  or  to  reload  the  array  frequently.  The 
resulting  saving  in  communications  time  and  other  overhead  contributes 
substantially  to  the  power  of  such  a machine. 


To  exploit  the  full  power  of  the  array  computer,  the  numerical 

algorithms  must  match  the  machine  architecture.  The  preliminary  Rand 

study  included  an  analysis  of  a finite-difference  approach  for  solving 

the  Navier-Stokes  equations  that  was  found  to  be  particularly  well 

suited  to  the  array  processor's  architecture.  This  approach  makes  use 

of  finite-difference  approximations  in  the  xy  plane  and  for  the  time 

variable  but  applies  transform  methods  to  the  z direction.  The  results 
* 

of  a model  problem  showed  that  the  problem  could  be  solved  at  a rate 
of  about  0.35  sec  per  time  step.  The  same  problem  run  on  a fast  con- 
ventional computer  (e.g.,  a CDC  7600)  takes  ~ 82  sec  per  time  step. 

The  conclusions  of  Rand's  initial  study  suggested  that  a special-purpose, 
parallel-processor  machine  capable  of  important  simulations  might  be 
technically  and  economically  feasible  in  the  early  1980  time  period. 

WORKSHOP 

To  explore  the  concept  further,  Rand  held  a workshop  on  March  9-10, 
1977.  Researchers  from  the  computer-technology,  numerical-analysis,  and 
fluid-dynamics  communities  discussed  the  practicality  and  timeliness  of 
this  idea. 

The  workshop  participants  generally  agreed  that  appropriate  tech- 
nology now  exists  to  consider  a machine  of  the  nature  proposed  capable 
of  attacking  problems  of  theoretical  and  practical  importance.  It  would 
have  a computing  capacity  of  it  least  an  order  of  magnitude  superior  to 
the  largest  current  computers  and  could  be  built  within  the  next  3 to  5 
years  at  a cost  substantially  less  than  current  computers. 

Both  adequate  algorithms  and  sufficient  numerical  experience  are 
available  for  the  solution  of  the  relevant  partial  differential  equa- 
tions. The  family  of  designs  considered  in  the  workshop  offers  the  po- 
tential for  efficient  direct  Navier-Stokes  simulations  of  nonlinear 
laminar  instabilities,  boundary- 1 ayer  transition  and  perhaps  turbulent 

* 

The  problem  consisted  of  the  3-D  solution  to  the  Navier-Stokes 
equations  for  incompressible  flow  of  a constant-density  fluid  in  a 
boundary  layer  adjacent  to  a rigid,  impermeable,  no-slip  wall.  The 
problem  was  modeled  on  an  array  of  10,000  cells  with  128  grid  points 
in  the  cross-stream  direction. 
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spots,  a limited  simulation  of  turbulence  at  low  to  moderate  Reynolds 
numbers,  and  a simulation  of  large-scale  turbulent  flows  at  high 
Reynolds  numbers  using  sub-grid  modeling. 

RECOMMENDATIONS /PRO POSED  ACTION 

Both  the  Rand  study  and  the  workshop  consensus  suggest  that  it  may 
be  feasible  to  construct  a Navier-Stokes  computer  by  the  early  1980s. 

A two-stage  research  program  should  follow.  In  the  first  stage,  both 
the  mathematical  problems  and  the  machine-design  problems  should  be 
explored  in  an  integrated  manner  in  order  to  confirm  the  hypothesis 
that  such  a machine  is  feasible.  This  initial  stage  should  include  a 
coordinated  program  of  research  on  algorithms,  microprocessor  design, 
ichine  architecture,  and  computer  software.  The  second  stage  should 
nsist  of  prototype  hardware  development,  including  the  design  and 
istruction  of  a small  experimental  array  of  processors  to  test  the 
design  concepts.  This  research  program  would  provide  the  basis  for  a 
decision  to  construct  a full-scale  array  processor  machine  to  solve 
the  Navier-Stokes  equations. 


— . — • 
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1,  INTRODUCTION 

The  fluid-dynamics  community  has  for  some  time  been  fascinated  by 
the  possibility  of  obtaining  direct,  numerical  solutions  to  the  Navier- 
Stokes  equations  for  turbulent  flows  and  flows  undergoing  transition 
to  turbulence.  Fluid-flow  calculations  were  among  1 be  earliest  appli- 
cations of  digital  computers.  For  viscous-flow  calculations,  either 
the  Reynolds  number  was  sufficiently  low  for  the  flow  to  be  truly 
laminar,  or  turbulence  was  introduced  in  a phenomenological  way  (e.g., 
the  use  of  a "turbulent  viscosity")  so  that  the  mean  flow  was  smooth. 
These  phenomenological  models  are  still  employed  and  will  certainly 
be  used  in  the  foreseeable  future  for  many  flow  calculations  of  major 
interest.  However,  the  models  are  empirical  and  success  i.  not  assured 
when  variations  in  flow  parameters  are  encountered.  Thus,  there  is  a 
continuing  interest  in  developing  the  capability  for  direct  solution 
of  the  Navier-Stokes  equations. 

During  the  fall  of  1976  and  winter  of  1977,  The  Rand  Corporation 
conducted  a corporate-funded  study  in  collaboration  with  members  of 
the  California  Institute  of  Technology  and  Old  Dominion  University 
staffs  to  investigate  the  feasibility  of  developing  a special-purpose 
computer  that  would  simulate  viscous  flow  around  realistic  geometries, 
including  some  aspects  of  the  development  of  turbulence.  The  Rand 
study  focused  on  special-purpose  parallel-array  computers  because  we 
anticipate  that  even  in  the  early  1980s  the  fastest  general-purpose 
computer  will  be  insufficient  to  cope  with  such  simulations. 

Increasing  the  amount  of  parallelism  in  computer  solutions  has 
long  been  recognized  as  the  best  way  to  achieve  substantial  performance 
improvement.  Rand  has  evolved  an  approach  which  accomplishes  this  for 
a class  of  problems.  The  Rand  concept  is  to  design  the  computer  archi- 
tecture around  the  mathematical  structure  of  the  Navier-Stokes  equations. 
Applications  to  other  specific  problem  areas  and  differential  equations 
may  be  feasible.  To  ensure  maximum  performance  improvement,  the  com- 
puter should  be  designed  by  a multidisciplinary  team  of  experts  in  com- 
puter architecture,  computational  fluid  mechanics,  numerical  analysis, 


algorithms,  and  software  systems.  This  will  ensure  that  the  machine 
architecture  is  tailored  correctly  to  the  simulation  requirements. 

The  computer  as  conceived  by  Rand  consists  of  an  array  of  10,000 
identical  large-scale  integrated  (LSI)  circuit  processors  arranged  in 
a 100  x 100  matrix  (see  Appendix  A(l)  by  Dr.  Ivan  Sutherland).  To  reduce 
wiring  congestion,  each  processor  is  limited  to  communication  with  its 
nearest  neighbors.  Each  processor  can  carry  out  simple  calculations 
(add  and  multiply).  Three-dimensional  problems  are  solved  by  having 
each  processor  represent  a point  in  a two-dimensional  plane  and  the 
storage  on  the  processor  is  used  to  represent  data  in  the  third  dimen- 
sion. Computation,  data  transfer,  and  communication  are  carried  out 
in  parallel.  The  result  is  that  both  the  total  number  of  addition/ 
multiplications  and  words  transferred  per  second  can  be  large  at  rela- 
tively low  overall  cost. 

The  reason  for  developing  such  a computer  is  to  make  it  possible 
to  perform  complex  fluid-flow  calculations  routinely . The  calculations 
can  then  be  used  as  standard  tools  by  researchers  in  fluid  dynamics 
and  designers  of  military  and  civilian  systems.  This  capability  con- 
trasts with  the  potential  of  contemporary  computers  for  performing 
these  calculations  once  every  few  months  by  calculating  for  hundreds 
of  hours  per  problem. 

A simple  example  may,  perhaps,  illustrate  this  point.  In  1952, 

L.  H.  Thomas^*)  used  about  300  hours  of  time  on  the  Selective  Sequential 
Electronic  Computer  of  IBM  to  compute  18  points  around  the  neutral  sta- 
bility curve  for  plane  Poiseuille  flow,  about  17  hours  per  point.  This 

(2) 

calculation  was  repeated  some  years  later,  as  a check  of  a different 

numerical  method.  It  took  about  20  minutes  on  an  IBM  7094  (1.1  minutes 

per  point).  In  addition  to  the  least  stable  mode,  29  other  modes  were 

also  calculated  at  each  point.  Quite  recently  this  calculation  was 
(3) 

repeated,  again  as  a test  problem,  on  a medium  sized  computer.  It 
only  required  2 minutes  to  compute  the  first  30  modes  for  all  18  points 
(6.7  sec  per  point).  The  great  speedup  in  this  type  of  calculation  has 
made  it  possible  to  use  linear-stability  theory  as  a routine  design 

(4) 

tool.  The  speedup  is  not  due  to  more  efficient  algorithms  but  is 

largely  due  to  the  increase  in  computer  size  and  speed. 
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Historically , the  performance  of  serial  computers  has  improved 
dramatically.  However,  this  improvement  has  not  been  sufficient  to 
allow  the  solution  of  the  Navier-Stokes  equations  directly.  Alter- 
native concepts  should  be  investigated,  and  the  Rand  study  demonstrates 
that  parallel  processing  is  a viable  candidate  architecture.  The 
central  theme  ot  :his  report  is  to  study  the  application  of  this  archi- 
tecture to  the  numerical  simulation  of  fluid-flow  problems. 

A.  BACKGROUND 

Some  time  ago,  Corrsin^"^  estimated  the  range  of  scales  of  motion 
in  a true  numerical  simulation  of  a turbulent  flow.  A true  simulation 
is  one  in  which  all  significant  scales  of  motion  are  accurately  repre- 
sented. Corrsin  concluded  that  the  requirements  were  far  beyond  the 
capability  of  any  computer  existing  at  that  time.  Emmons reexamined 
this  question  in  1970  and  concluded  that,  in  the  not-too-distant  future, 
low-Reynolds-number  turbulent  flows  might  be  within  the  range  of 
investigation. 

(7  8) 

Recently,  Case  et  al.  ’ reconsidered  the  question:  Is  the  direct 
numerical  simulation  of  turbulence  possible  with  the  most  modern  com- 
puters? They  found  that  a few  simulations  had  already  been  carried  out 
at  moderate  Reynolds  numbers,  but  concluded  that,  even  with  the  most  mod- 
ern computers  likely  to  be  available  in  the  next  few  years,  only  a limited 
range  of  Reynolds  numbers  can  be  treated.  They  reached  these  conclusions: 
The  construction  of  a high-performance  conventional  computer  to  do  high- 
Reynolds-number  simulations  appears  impractical  for  the  next  10  to  15 
years;  this  problem  could  be  attacked  by  a very  large  (on  the  order  of 
10  cells)  cellular-array  computer,  which  might  be  feasible  in  the  mid- 
1980s;  and  a smaller-array  computer  useful  for  many  problems  appeared  to 
be  feasible  in  the  near  future. 

These  conclusions  led  them  to  suggest  that  a possible  solution  would 
be  to  build  a Navier-Stokes  computer.  They  envisioned  a special-purpose 
computer  designed  to  solve  the  Navier-Stokes  equations,  and  only  the 
Navier-Stokes  equations,  with  maximum  efficiency.  By  using  turbulence 
models,  such  a computer  could  further  be  used  for  the  numerical  simula- 
tion of  those  fluid-dynamic  problems,  which,  even  on  the  Navier-Stokes 
computer,  are  beyond  the  range  of  direct  simulation. 


L A 
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A capability  to  perform  direct  simulations  and  complex  "model" 
computations  would  be  a powerful  resource  for  researchers  and  designers 
of  military  and  civil  fluid-mechanics  systems.  Computer  simulations 
of  a fluid-mechanics  problem  offer  several  advantages  over  physical 
experiments.  These  include:  flexibility  for  independently  varying 
flow  parameters,  the  capacity  to  simulate  complex  geometries  not  pos- 
sible during  an  experiment,  control  of  boundary  conditions,  and  quanti- 
tative estimates  of  flow  characteristics  impossible  to  measure.  These 
advantages  are  quite  important  to  a designer  of  systems  who  needs  to 
predict  aerodynamic  or  hydrodynamic  forces  accurately. 

Many  practical  problems  require  the  accurate  and  rapid  calculation 
of  fluid-dynamic  forces  and  flow  phenomena,  including  aircraft  and  mis- 
sile design,  meteorology  and  global  weather  forecasting,  environmental 
simulation  modeling,  and  submarine  and  ship  design.  Progress  in  the 
quantitative  analysis  of  these  problems  is  limited  to  a great  extent  by 
the  computing  power  available  for  the  simulation  of  flow  phenomena. 
Advances  in  these  areas  would  benefit  greatly  from  the  development  of 
a "Navier-Stokes"  computer. 

During  the  past  few  years,  large  advances  in  computer  technology 
have  taken  place.  Although  we  are  now  entering  the  era  of  the  super- 
computer that  will  process  hundreds  of  millions  of  instructions  per 
second,  researchers  are  interested  in  performing  more  accurate  simula- 
tions of  physical  phenomena,  creating  a demand  for  still  larger  com- 
puters. In  spite  of  the  present  impressive  performance  of  serial  pro- 
cessing general-purpose  computers,  an  upper  limit  to  their  capabilities 
seems  close.  Even  using  large-scale  integrated  micro-circuits,  improved 
heat-extraction  techniques,  innovative  hardware  design,  and  creative 
numerical  techniques,  improvements  only  on  the  order  of  factors  of  2 to 
5 are  anticipated  in  the  early  1980s  with  conventional  architecture. 

This  performance  is  inadequate  to  simulate  numerically  many  fluid-dynamic 
problems  without  questionable  phenomenological  modeling. 

Communication  limits  performance  improvement  because  instructions, 
data,  and  results  are  transmitted  at  a finite  rate  over  a finite  dis- 
tance. Sequential  handling  of  the  massive  amount  of  data  generated  in 
large  computers  is  thus  rate-limited  and  this  is  a definite  bottleneck 
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in  increasing  computational  speed.  Attempts  to  overcome  this  problem 
have  produced  more  complex  hardware  designs. 

To  penetrate  this  performance  barrier,  researchers  have  proposed 
the  concept  of  a computer  consisting  of  a parallel  array  of  thousands 
of  fast  microprocessors.  Data  are  divided  and  distributed  among  a 
number  of  microprocessors  that  perform  identical  calculations.  The 
architecture  of  such  a computer  is  strongly  dictated  by  the  character- 
istics of  the  problem  to  be  solved;  so  one  must  incorporate  the  nature 
of  the  simulation  problem  in  the  design  of  the  hardware,  the  numerical 
analysis,  and  the  development  of  software.  A numerical  simulation  that 
requires  many  different  types  of  operation  on  a single  set  of  data  will 
not  exploit  the  advantages  of  this  parallel  arrangement,  and  many  pro- 
cessors will  stand  idle  while  a few  are  working. 

B_. PARALLEL-PROCESSOR  COMPUTERS 

The  concept  of  an  array  computer  is  quite  old.  (See,  for  example, 

the  discussion  and  description  of  the  Von  Neumann  array  computer  in 
(9) 

Thurber.  ) Far  fewer  array  computers  than  sequential  computers  have 
been  built  in  the  past  because  of  system  complexity  and  high  cost. 
Individual  processors  were  expensive,  and  reproducing  large  numbers  of 
components  increased  costs  plus  control  overhead. 

(9) 

ILL1AC  IV  is  the  best  known  general-purpose  array  processor. 

It  can  be  viewed  as  an  8 x 8 array  of  cells  (processing  elements,  PE's) 
arranged  in  a grid  communicating  with  nearest  neighbors  and  end  around. 
Each  PE  is  a rather  sophisticated  general-purpose  computer  and  has  2K 
64-bit  words  of  memory. 

In  a way,  ILLIAC  IV  is  both  too  big  and  too  small  for  the  Navier- 
Stokes  equations.  It  is  a general-purpose  computer  with  more  features 
than  needed  for  the  purposes  here.  It  is  also  too  big  because  it  was  de- 
signed in  anticipation  of  major  advances  in  a number  of  areas  of  com- 
puter technology.  Some  of  these  advances  did  not  occur  on  schedule, 
so  its  performance  and  reliability  did  not  reach  original  design  goals. 

At  the  same  time,  ILLIAC  IV  is  too  smal 1 because  an  8 x 8 array  is  in- 
adequate for  most  problems  in  computation  fluid  dynamics.  If  the  ILLIAC 

memory  is  used  in  the  32-bit  mode,  the  4K  words  of  memory  per  PF.  can  hold 

2 

the  data  for  about  4 * 10  grid  points  assuming  10  variables  per  grid 
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point.  Therefore  ILLIAC  can  hold  about  2.5  * 10  grid  points  in  PE 
memory — far  too  few  for  most  problems  in  computational  fluid  dynamics. 
Increasing  the  number  of  grid  points  would  require  frequent  loading  and 
unloading  of  the  PE  memories.  A realistic  viscous  fluid-dynamics 
problem  will  require  at  least  10^  to  10^  grid  points  to  resolve  the 
important  and  experimentally  observed  features  of  a high-Reynolds- 
number  flow.  The  realization  of  a computer  which  can  accommodate  such 
a large  number  of  grid  points  is  suggested  by  the  idea  of  a special- 
purpose,  fluid-dynamics  computer  with  tens  of  thousands  of  parallel 
microprocessors . 

Recent  advances  in  large-scale  integrated  circuit  technology  and 
reduction  in  manufacturing  costs  have  removed  the  economic  barrier  to 
the  use  of  large  numbers  of  microprocessors  in  the  array  computer. 

Once  the  circuit  design  and  set-up  costs  are  incurred,  chip  reproduc- 
tion costs  are  small.  By  using  current  technology  in  the  chip  design, 
as  developed  for  widespread  use  throughout  the  electronics  industry, 
the  development  costs  and  technical  risks  associated  with  the  construc- 
tion of  the  array  machine  are  reduced  without  compromising  the  effec- 
tiveness of  the  design. 

The  initial  conclusions  of  the  Rand  study  suggested  that  an  array 
computer  capable  of  important  simulations  of  the  Navier-Stokes  equations 
might  be  technically  and  economically  feasible  by  the  early  1980s.  To 
further  explore  the  practicality  and  timeliness  of  this  concept,  Rand 
invited  researchers  from  the  computer-technology , numerical-analysis, 
and  fluid-dynamics  communities  to  attend  a workshop  on  March  9-10,  1977. 
This  report  summarizes  the  deliberations  of  the  workshop  along  with  the 
results  of  the  initial  Rand  study,  which  was  used  as  a straw  man  by  the 
workshop  participants.  It  also  presents  the  conclusions  derived  from 
these  discussions.  Section  II  describes  the  workshop's  organization 
and  the  results  of  the  preliminary  Rand  study.  Section  111  summarizes 
the  deliberations  of  the  working  groups.  Conclusions  and  recommendations 
are  given  in  Section  IV.  Detailed  presentations  of  Rand's  initial  design 
study  and  numerical  analyses,  a list  of  workshop  participants,  and  the 
agenda  are  included  in  the  appendixes. 
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II.  WORKSHOP  ORGANIZATION  AND  DISCUSSION  OF  RESULTS 
OF  PRELIMINARY  RAND  STUDY 


The  workshop  provided  a forum  for  researchers  involved  in  the 
various  aspects  of  the  problem;  hardware  architecture,  software  sys- 
tems, and  applications.  It  included  informal  presentations  on  each 
aspect  to  spark  discussion  and  an  interchange  of  ideas.  The  objectives 
of  the  workshop  were  to  determine  the  status  of  computer  technology  and 
numerical  techniques  necessary  for  the  development  of  a special-purpose 
computer  designed  to  solve  the  Navier-Stokes  equations;  to  discuss  its 
feasibility  in  the  early  1980s;  and  to  generate  from  a group  proficient 
in  both  numerical  and  experimental  fluid  mechanics  suggestions  for 
appropriate  applications  to  key  fluid-dynamics  problems. 

The  first  day  of  the  two-day  workshop  was  devoted  to  the  informal 
presentations  and  the  second  to  working  group  discussions.  The  subjects 
covered  during  the  first  day  were  computer  technology,  numerical  simula- 
tion, and  fluid-mechanics  applications.  The  purpose  of  these  presenta- 
tions was  to  provide  a focus  for  working  group  discussions  and  to  provide 
an  interdisciplinary  interchange  of  ideas  and  information.  An  underlying 
theme  was  to  isolate  the  pacing  technologies  and  issues  and  to  arrive  at 
a preferred  development  program  if  the  state-of-the-art  could  be  shown 
to  warrant  such  actions. 

During  the  first  day,  Rand  consultants  and  members  of  the  research 
staff  presented  the  results  of  their  initial  study.  This  included  a 
preliminary  concept  of  the  basic  architecture  of  i_he  machine  (see  Appen- 
dix A(I)  by  I.  Sutherland),  a study  of  the  numerical  techniques  that  could 
efficiently  exploit  the  parallel  arrangement  (see  Appendix  C by  C.  E. 
Grosch) , and  suggestions  for  possible  fluid  mechanics  applications  (see 
Section  III(C)  by  C.  Gazley,  Jr.). 

The  straw-man  computer  concept  developed  by  Dr.  Sutherland  is  an 
4 

array  consisting  of  10  identical  large-scale  integrated  circuits  that 

are  arranged  in  a 100  * 100  matrix.  Each  processor  is  capable  of  stor- 
4 

ing  about  1.6  x 10  bits  of  information  arranged  as  256  words  of  64 
bits  or  512  words  of  32  bits.  To  reduce  wiring  congestion,  each 
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processor  (cell)  is  limited  to  communication  with  its  nearest  neigh- 
bors. This  feature  minimizes  the  cost  of  interconnections,  which  could 
be  a major  expense.  In  addition  to  this  simple  local  communication, 
there  are  communication  lines  from  a central  source  to  the  10,000  pro- 
cessors. To  enhance  the  computing  power  of  each  processor,  it  is 
desirable  for  each  to  have  a maximum  of  storage  and  some  space  for 
storing  algorithms.  Three-dimensional  problems  are  solved  by  having 
each  processor  represent  a point  in  a two-dimensional  plane,  and 
storage  on  each  processor  is  used  to  represent  data  in  the  third  dimen- 
sion. This  points  out  the  necessity  of  having  some  memory  associated 
with  each  processor.  After  reviewing  Dr.  Sutherland's  concept  during 
the  two-day  workshop.  Dr.  Gaines  developed  a variant  on  this  concept, 
and  this  is  included  in  Appendix  A. 

The  potential  benefits  of  the  parallel  architecture  are  enormous. 

We  estimate  that  each  processing  element  can  carry  out  a t>4-bit  fixed- 
point  multiplication  in  5 microseconds  (double-precision  multiplication, 
each  word  32  bits  long).  Though  this  is  quite  slow  compared  with  the 
speed  of  today's  best  computers,  an  array  of  10,000  such  elements  per- 
forms simultaneously  10,000  multiplications  in  that  time,  or  approxi- 
mately 2 billion  multiplications  per  second.  We  also  estimate  that 
each  processor  can  transfer  one  word  to  a neighboring  processor  in 

about  3 microseconds.  Again,  this  is  quite  slow  by  today's  standards; 

4 9 

but,  with  10  processors  the  data  rate  is  about  3 < 10  words/sec,  or 

about  10^  bits/sec.  This  is  quite  high  when  compared  to  the  memory- 

to-CPU  data  rate  of  even  the  most  powerful  conventional  computers. 

These  large  computation  and  communication  rates  can  lead  to  a very 
high  performance/cost  ratio,  given  that  integrated  circuits  of  the  type 
described  are  produced  in  the  near  future  for  $100  per  chip  or  less.  An 
important  characteristic  of  the  array  of  processors  just  described  is 
that,  given  some  reasonable  amount  of  memory  per  chip,  an  entire  large 
problem  can  fit  on  this  computer  at  one  time.  Thus  there  would  be  no 
need  either  to  break  the  solution  into  sequential  pieces  or  to  reload 
the  array  frequently.  The  resulting  saving  in  communications  time  and 
other  overhead  is  expected  to  contribute  substantially  to  the  power  of 
such  a machine. 
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To  exploit  the  full  power  of  the  array  computer,  the  numerical 
algorithms  must  reflect  the  machine  architecture.  To  be  effective, 
the  problem  has  to  fit  the  machine.  Research  in  parallel  numerical 

(9_14) 

algorithms  has  been  pursued  intensively.  However,  it  is  not 

clear  that  this  body  of  research  has  direct  application  to  the  array 
computer  consisting  of  tens  of  thousands  of  microprocessors  with 
nearest  neighbor  communications.  At  the  workshop.  Dr.  Grosch  presented 
his  analysis  of  the  results  from  a test  case  designed  to  utilize  the 
array  architecture  proposed  by  Dr.  Sutherland  (see  Appendix  C) . 

In  this  analysis  the  three-dimensional  Navier-Stokes  equations 
in  primitive  variables  and  the  Poisson  equation  for  pressure  had  been 
Fourier-transf ormed  in  the  span-wise  direction.  For  the  remaining  two 
space  dimensions  and  time  variable,  the  equations  were  approximated  by 
using  second-order-accurate  finite-difference  schemes.  The  Adams- 
Bashforth  method  was  used  to  solve  the  velocity  equations,  and  several 
point-wise  relaxation  schemes  were  considered  for  solving  the  Poisson 
equation.  The  results  of  the  analysis  of  a model  problem  consisting 
of  a 100  x 100  array  of  cells  and  128  Fourier  components  showed  that 
the  problem  could  be  solved  at  a rate  of  about  0.35  sec  per  time  step. 

Professor  Grosch  concluded  that  velocity  calculations  were  amen- 
able to  an  array  computer  but  that  the  pressure  calculations  were  more 
difficult  and  required  more  than  60  percent  of  the  computing  time. 

This  implies  that  computational  performance  improvements  are  being 
paced  by  improvements  in  the  pressure  calculations  and  that  experimen- 
tation on  ways  to  perform  these  calculations  could  lead  to  a substantial 
payoff.  The  major  conclusion  of  this  study  is  that  it  is  possible  to 
develop  a numerical  strategy  for  using  a parallel-processor  machine 
while  accepting  the  constraint  of  local  communication  between  nearest 
neighbors  only. 

During  the  second  day  of  the  workshop,  the  participants  were  divided 
into  three  groups  to  discuss  each  of  the  critical  areas  of  interest: 
computer  technology  and  architecture,  numerical  analysis,  and  fluid- 
mechanics  application.  The  members  of  the  working  groups  are  listed 
in  Appendix  B,  and  their  deliberations  are  summarized  in  the  following 
sections  of  this  report. 
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III  . SUMMARY  OF  WORKING  GROUP  DELIBERATIONS 

The  workshop  participants  were  selected  to  provide  as  broad  a 
representation  as  possible  from  the  relevant  research  areas.  This  pro- 
vided an  opportunity  for  an  active  and  free  exchange  of  ideas  and  expe- 
rience. The  first  day's  series  of  briefings  on  computer  design  and 
technology,  numerical  analysis,  and  fluid-mechanics  applications  gave 
each  participant  an  appreciation  for  the  problems  outside  his  area  of 
expertise  and  helped  to  provide  a common  basis  for  the  discussions  of 
the  second  day. 

Each  group  was  charged  with  the  task  of  evaluating  the  status  of 
its  area  of  concern  as  that  area  related  to  the  design  of  a special- 
purpose  parallel-processor  machine  for  the  solution  of  the  Navier-Stokes 
equations.  If  the  status  of  the  area  was  too  immature  for  reasonable 
predictions,  the  groups  were  to  suggest  possible  developmental  or  exper- 
imental programs  needed  to  advance  the  state-of-the-art  of  their  areas 
to  a level  required  to  develop  such  a machine.  The  summaries  of  the 
working-group  deliberations  along  with  their  conclusions  are  presented 
in  this  section. 

A.  COMPUTER  TECHNOLOGY  AND  ARCHITECTURE 

The  technology  and  architecture  group  addressed  issues  ranging 
from  the  feasibility  of  such  a machine  through  several  aspects  of  over- 
all design  philosophy  and  approach.  A major  conclusion  of  this  group 
was  that  appropriate  technology  now  exists  to  enable  one  to  consider 
an  interesting  machine  of  the  nature  proposed.  Important  to  the  develop- 
ment of  such  a system  would  be  work  on  algorithms,  architecture,  and 
software.  Tt  was  also  recommended  that  feasibility  studies  and  modeling 
must  be  done  before  any  commitment  is  made  to  large  amounts  of  funding 
and  manpower.  These  and  other  topics  and  recommendations  considered 
important  are  covered  in  detail  in  the  following  discussion. 

1.  Algorithms  and  Architecture 

a.  First  priority  in  the  design  of  an  array  computer  should  be 
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given  to  an  extensive  effort  aimed  at  algorithm  development  and  simula- 
tion. This  effort  should  be  concurrent  with,  and  should  interact  with, 
the  architectural  development.  This  is  important  since  experience  with 
developing  algorithms  for  parallel  processing  is  limited.  While  algo- 
rithms, software,  and  hardware  can  be  modeled  on  general-purpose  com- 
puters, such  simulations  must  incorporate  processing  appropriate  to  the 
methods  being  studied,  and  must  include  communications  and  timing  over- 
head. As  part  of  the  algorithm-simulation  and  cost  trade-off  analysis, 
small  hardware  prototypes  and  experiments  should  be  considered.  With 
these,  hard  data  could  be  developed  with  which  to  validate  the  simulation 
and  modeling — particularly  in  terms  of  data-and-control  paths  and  their 
timing. 

b.  Above  all,  simplicity  of  interconnections  must  dominate  archi- 
tecture at  all  levels:  within  LSI,  at  the  board  level,  and  at  the 
system  level. 

2.  Reliability 

a.  Reliability  in  both  hardware  and  software  is  of  primary  concern 
and  must  be  designed  in  at  all  levels  from  the  beginning.  A major  con- 
tributor to  failure  is  memory.  Memory  within  the  array  should  include 
single-error  correction  and  double-error  detection.  In  addition,  con- 
siderable effort  should  be  devoted  to  the  development  of  methods  for 
error  correction  and  detection  in  each  computational  element.  While  it 
was  recommended  that  gradual  degradation  be  included,  it  was  recognized 
that  it  is  not  always  feasible.  Diagnostic  hardware  and  software  includ- 
ing restart  and  rollback  procedures  should  be  given  extensive  attention 
during  system  design  and  should  be  available  early. 

b.  One  major  question  that  is  yet  to  be  answered  is:  What  is  the 
maximum  number  of  processors  that  can  be  assembled  without  exceeding  an 
acceptable  failure  rate? 

3.  Component  Technology 

At  the  time  specifications  for  a processing  system  are  frozen,  on’v 
production  LSI  should  be  considered.  For  custom  LSI,  such  as  read-only 
memory  systems  (ROMS)  or  programmable  read-only  memory  systems  (PROMS), 
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only  those  design  rules  and  processing  methods  then  currently  in  use 
for  high-volume  construction  of  such  components  should  be  considered. 
Before  any  custom  LSI  components  are  designed,  subject  to  the  above 
constraints,  a thorough  search  for  available  component  alternatives 
should  be  made.  Further,  a realistic  performance/cost  analysis,  in- 
cluding design,  testing,  and  estimated  production  volume  costs,  should 
be  performed. 

4.  Software 

a.  Major  emphasis  should  be  placed  on  the  development  of  a high- 
level  language  appropriate  to  the  specific  machine  design  and 
applications . 

b.  Special  instructions  that  are  important  to  performance  should 
be  available  to  the  user. 

c.  A language  simulator  should  be  available  early. 

5.  Human  Interface 

a.  In  addition  to  the  usual  input  techniques,  it  was  felt  that 
interactive  graphical  data-input  techniques  would  be  a desirable  fea- 
ture. However,  it  was  strongly  recommended  that  standard  available 
equipment  be  used  with  a minimum  of  special  software  and  special  inter- 
faces in  order  to  keep  costs  down. 

b.  Post-processing  should  include  not  only  the  processing  of 
output  data  but  also  the  ability  to  take  a snapshot  of  what  is  going 

on  in  the  algorithms,  the  solutions,  or  the  hardware  during  the  computa- 
tion. Limited  post-processing  for  snapshots  must  run  in  parallel  with 
the  computations.  Intermediate  results  must  be  available  in  as  near 
real  time  as  possible,  again  primarily  in  the  form  of  snapshots. 
Appropriate  software  should  be  available  early  for  obtaining  inter- 
mediate results  and  post-processing  as  an  aid  to  system  debugging  and 
application  development. 

c.  The  host  system,  or  control  unit,  should  be  a readily  avail- 
able unit.  The  Interface  between  the  Navier-Stokes  computer  and  the 
host  system  should  use  off-the-shelf  software  and  hardware  including 


input/output  if  at  all  possible.  The  intent  of  this  recommendation 
is  the  avoidance  of  unnecessary  costs  and  development  efforts. 

6.  Data  Base  Management 

There  was  insufficient  time  to  consider  in  detail  the  question  of 
data-base  management  for  a system  that  would  be  as  data-intensive  as 
the  proposed  system.  This  is  viewed  as  an  extremely  important  subject 
and  must  be  addressed  in  terms  of  the  architecture,  product  availability, 
data-and-control  communications  paths,  cost,  reliability,  and  modularity. 
It  was  suggested  that  this  might  be  more  important  than  what  goes  on 
within  the  array  itself. 

B.  NUMERICAL  SIMULATIONS 

The  principal  conclusion  of  the  numerical  simulation  working  group 
is  that  adequate  algorithms  and  sufficient  numerical  experience  exist 
for  the  solution  of  the  relevant  partial  differential  equations.  This 
suggests  that  it  is  now  appropriate  to  initiate  a feasibility  study  of 
the  design  of  a parallel-processor  machine  for  the  solution  of  the 
problems  envisioned  employing  appropriate  numerical  techniques.  It  is 
expected  that,  as  time  goes  by,  improvements  in  algorithm  development 
will  accelerate.  This  in  no  way  should  delay  the  progress  of  a feasi- 
bility study  in  which  modeling  and  experimentation  will  play  key  roles. 
The  working  group  also  made  specific  observations  and  recommendations 
as  follows. 

1.  Variables  and  Equations 

a.  For  three-dimensional  problems,  the  primitive  variables,  veloc- 
ity and  pressure,  are  believed  to  be  computationally  superior  to  vector 
vorticity  and  vector  potential.  They  demand  less  storage  (four  dependent 
variables  instead  of  six),  boundary  conditions  are  more  easily  applied, 
and  low-speed  compressibility  and  energy  transfer  are  more  easily  added 
to  the  equations. 

b.  Employing  the  continuity  and  momentum  equations  (where  the 
pressure  gradients  in  the  momentum  equations  would  be  adjusted  to  guar- 
antee that  the  continuity  equation  is  satisfied)  as  the  governing 
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equations  gives  computational  advantages  over  the  use  of  the  momentum 
and  pressure  (Poisson)  equations.  The  solution  of  an  elliptic  equation 
for  the  pressure  is  replaced  by  pressure  adjustments  in  the  momentum 
equation.  Even  though  the  relative  merits  of  each  method  are  not  fully 
understood,  progress  in  a feasibility  study  will  not  be  hindered. 

2.  Numerical  Methods 

a.  The  method  of  finite  differences  was  favored  for  the  numerical 
solution  of  the  equations  over  spectral  and  other  methods.  The  prin- 
cipal considerations  that  led  to  this  conclusion  were:  the  expectation 
that  problems  with  bodies  of  more  complicated  shapes  than  flat  plates, 
spheres,  etc.,  would  be  more  difficult  to  treat  by  spectral  methods; 
the  existence  of  a very  large  body  of  favorable  experience  and  knowl- 
edge of  advantages  and  pitfalls  with  finite-difference  schemes  compared 
to  the  other  methods;  the  flexibility  and  robustness  of  the  method;  and 
the  ease  of  programming.  The  major  disadvantage  is  that  some  problems, 
by  their  nature,  will  require  nonrectangular  grids. 

b.  Second-order  differences  in  space  and  time  have  advantages 
over  higher-order  differences  because  of  lower  storage  and  communica- 
tion requirements,  ease  of  application  of  boundary  conditions,  shorter 
computation  time  per  step,  and  ease  of  programming.  (It  is  possible 
that  in  using  higher-order  methods,  lower-order  approximations  for  the 
boundary  conditions  might,  in  some  problems,  propagate  the  same  order 
of  error  throughout  the  solution  thereby  destroying  whatever  advantages 
the  higher-order  methods  seem  to  have.)  Thus,  the  group  recommends 
second-order  over  higher-order  differences  for  these  problems. 

c.  The  numerical  stability  advantages  of  implicit  finite-difference 
schemes  over  explicit  schemes  for  the  solution  of  the  Navier-Stokes 
equations  is  not  as  obvious  for  a three-d imensional  simulation  as  it  is 
for  boundary-layer  flow.  Additional  programming,  extra  arithmetic  oper- 
ations, logic  complications,  and  extra  storage  demands  may  obscure  antic- 
ipated benefits  of  a large  time  step,  except  perhaps  for  computations 
that  extend  to  steady-state  flow.  Thus,  a critical  assessment  of  the 
relative  merits  of  implicit  vs.  explicit  schemes  remains  to  be  made  and 
could  be  accomplished  by  performing  appropriate  numerical  experiments. 
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Such  studies  can  prove  or  disprove  the  conjecture  that  some  implicit 
schemes  should  be  considered  as  iterated  explicit  schemes,  a point 
which  can  allow  the  computer  hardware  to  be  the  same  regardless  of  the 
nature  of  the  scheme. 

d.  If  the  design  of  the  computer  were  to  adhere  to  a principle 

of  nearest-neighbor  communication,  it  is  believed  that  iterative  methods 
for  solution  of  the  Poisson  equation  and  for  the  Navier-Stokes  equations 
(if  implicit  finite-difference  equations  are  used)  would  be  more  appro- 
priate than  direct  methods  to  both  the  computation  and  the  programming. 
There  may  be  some  advantage  to  incorporating  a direct  tridiagonal  linear 
systems  solver,  because  this  might  offer  the  possibility  of  using  line 
successive  overrelaxation  methods  or  alterning  direction  methods. 

This  can  be  investigated  during  an  algorithm-simulation  study.  Recent 
developments  in  accelerating  the  convergence  of  relaxation  methods  for 
the  solution  of  the  elliptic-difference  equations  or  implicit  parabolic- 
difference  equations  by  adaptively  coarsening  and  refining  the  differ- 
ence-equation grids  suggest  that  provisions  to  incorporate  such  a capa- 
bility into  the  design  be  seriously  considered. 

e.  Some  recent  experiments  comparing  finite-element  methods  with 
finite-difference  methods  on  advective/dif fusive  problems  suggest  that 
the  issue  of  the  selection  of  finite-difference  methods  over  finite- 
element  methods  may  not  be  closed.  The  finite-element  methods  have 
considerable  leeway  in  the  irregularity  of  the  grids  and  incorporate 
boundary  conditions  as  well.  However,  the  finite-element  experiments 
indicated  above  were  not  in  three  space  dimensions  and  were  applied  to 
linear  problems.  What  is  clear  is  that  there  is  limited  experience  in 
utilizing  finite-element  methods  in  fluid-mechanics  problems. 

C.  FLUID-MECHANICS  APPLICATION  AREAS  FOR  A SPECIAL-PURPOSE  COMPUTER 

In  considering  the  spectrum  of  possibilities  for  a special-purpose 
computer  for  application  to  problems  in  fluid  mechanics,  there  was  a 
general  consensus  in  the  Applications  Group  that  the  first  such  machine 
should  be  at  the  smaller  end  of  the  spectrum.  It  should  be  relatively 
inexpensive  and  relatively  simple.  In  considering  the  class  of  prob- 
lems which  are  "do-able"  and  which  are  of  sufficient  general  "practical" 
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interest  to  permit  development  of  such  a machine,  the  group  agreed  that 
the  problem  of  incompressible  (but  possibly  variable-property)  boundary- 
layer  stability  and  transition  is  a good  choice. 

1.  Stability  and  Transition 

This  is  a problem  which  is  of  great  practical  importance  in  both 
aerodynamics  and  hydrodynamics,  is  difficult  and  expensive  to  study 
experimentally,  and  is  prohibitively  expensive  to  compute  on  general- 
purpose  machines.  A machine  which  would  allow  computation  of  laminar 
instabilities,  including  nonlinear  and  three-dimensional  effects,  and 
the  effects  of  external  disturbances  on  their  origin  and  growth  to 
turbulent  spots  would  be  of  great  practical  interest.  It  would  also 
be  of  considerable  theoretical  interest  since  such  computations  would 
improve  the  understanding  of  the  physics  of  laminar  instability  and 
their  growth. 

2.  Resolution  and  Memory 

Of  importance  to  the  architecture  of  such  a machine  is  the  resolu- 
tion required  for  computation  of  laminar  instabilities.  The  group  esti- 
mated that  at  least  a million  grid  points  would  be  required  for  such 
computations.  The  resolution  prob’am  could  be  minimized  by  developing 
a grid  in  which  the  mesh  automatically  becomes  finer  in  those  parts  of 
flow  where  instabilities  begin  to  develop.  Also  of  concern  to  the 
machine  architects  (in  the  subsequent  discussion)  are  the  memory  re- 
quirements for  this  grid  size. 

3.  A Prototype  Machine 

There  was  general  agreement  that  a prototype  machine  would  be 
desirable.  This  would  probably  not  be  a large  enough  machine  to  yield 
any  new  information  on  fluid  physics  but  would  be  very  useful  for  im- 
proving knowledge  of  the  interactions  between  algorithm  and  architecture — 
thus  improving  the  architectural  design  of  the  ultimate  machine. 

4.  Efficiency  and  Development 

A desire  was  expressed  that  the  machine  be  efficient  enough  to 
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keep  down  the  cost  of  re-running  problems  for  which  more  detailed  out- 
put is  desired.  It  was  recommended  also  that  an  experimenter,  familiar 
with  the  corresponding  analog  machine  (i.e.,  reality),  be  associated 
with  the  development  of  the  machine. 
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IV.  CONCLUSIONS  AND  RECOMMENDATIONS 

The  workshop  participants  generally  agreed  that  appropriate  tech- 
nology now  exists  to  build  a large  parallel-processor  array  computer 
consisting  of  thousands  of  microprocessors  capable  of  some  direct  simu- 
lations of  the  Navier-Stokes  equations.  Such  a machine  would  have  a 
computing  capacity  of  at  least  an  order  of  magnitude  superior  to  current 
supercomputers  and  could  be  built  within  the  next  3 to  5 years  at  a 
cost  substantially  less  than  current  general-purpose  supercomputers. 

Adequate  algorithms  and  sufficient  numerical  experience  were  found 
to  exist  for  the  solution  of  the  relevant  partial  differential  equations. 

As  an  example,  a test  case  was  examined  consisting  of  an  incompressible 
flow  of  a constant-density  fluid  in  a boundary  layer  adjacent  to  a rigid, 
impermeable,  no-slip  wall.  The  three-dimensional  Navier-Stokes  equa- 
tions were  solved.  The  results  of  a model  problem  as  applied  to  an 
array  of  10,000  cells  with  128  grid  points  in  the  cross-stream  direc- 
tion showed  that  the  problem  could  be  solved  at  a rate  of  about  0.35 
sec  per  time  step.  This  is  about  230  times  faster  than  a fast  conven- 
tional computer  (e.g.,  a CDC  7600). 

The  large  array  computer  evaluated  in  the  workshop  was  found  to 
offer  the  potential  for  efficient  direct  Navier-Stokes  simulations  of 
nonlinear  laminar  instabilities  and  boundary-layer  transition.  A 
limited  simulation  of  turbulence  at  low  to  moderate  Reynolds  numbers 
and  a simulation  of  large-scale  turbulent  flows  at  high  Reynolds'  num- 
bers using  sub-grid  modeling  also  appear  feasible  with  such  a computer. 

The  results  of  this  preliminary  work  suggest  that  it  is  now  appro- 
priate to  continue  to  explore  the  feasibility  of  designing  a parallel- 
processor  machine  for  the  solution  of  the  Navier-Stokes  equations  using 
applicable  numerical  techniques.  Important  to  the  development  of  such 
a system  is  the  early  initiation  of  a coordinated  program  of  research 
on  algorithms,  architecture,  and  software  for  this  computer.  We  sug- 
gest that  the  research  plan  be  undertaken  in  stages. 

In  the  first  stage,  both  the  mathematical  problems  and  the  machine- 
design  problems  should  be  explored  in  an  integrated  manner  in  order  to 
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conflnn  the  hypothesis  that  it  is  reasonable  to  build  such  a machine. 
Machine-design  problems  which  need  to  be  explored  include:  the  design 
of  the  processor  itself,  a determination  of  how  actually  to  assemble 
an  array  of  10,000  computers,  an  investigation  of  the  problem  of  com- 
munications between  elements  of  the  array,  an  analysis  of  questions 
concerning  the  appropriate  type  and  balance  of  memory  and  computational 
capability  associated  with  each  processor,  and  a determination  of  appro- 
priate peripheral  equipment  including  means  for  input/output . None  of 
these  questions  are  simple,  and  it  would  be  a mistake  at  this  point  to 
rush  into  the  design  of  such  a machine  without  preliminary  investiga- 
tion of  them. 

Algorithm  development  must  be  concurrent  with  and  must  interact 
with  the  architectural  development.  The  challenge  in  applying  the 
parallel-processor  concept  is  to  couch  the  problem  to  be  solved  in  a 
form  which  does  not  require  each  processor  to  communicate  with  more 
than  a few  of  its  nearest  neighbors,  since  both  many  communication 
paths  and  long  communication  paths  make  the  computer  architecture  fund- 
amentally expensive.  The  extent  to  which  one  is  able  to  develop  algo- 
rithms that  map  the  geometry  of  the  problem  (including  boundary  condi- 
tions) onto  the  geometry  of  the  computer  will  determine  the  magnitude 
of  the  computational  gains  achieved  as  promised  by  the  parallel- 
processing architecture.  This  requires  a close  working  relationship 
between  the  computer  architects  and  numerical  ana^sts.  Algorithms, 
software,  and  hardware  should  be  modeled  using  available  general-purpose 
computers.  These  simulations  must  perform  the  processing  in  a manner 
that  accurately  models  the  type  of  architecture  proposed  for  the  actual 
machine. 

This  first  stage  should  include  an  analysis  of  the  range  of  problems 
that  can  be  solved  using  the  array  architecture  being  considered.  This 
analysis  should  use  a realistic  estimate  of  the  computational  power  of 
the  proposed  machine  based  on  the  performance  characteristics  of  hard- 
ware as  it  exists  or  will  exist  one  to  two  years  in  the  future.  Extrapo- 
lation five  to  ten  years  into  the  future  must  be  avoided  if  the  analysis 
is  to  remain  credible. 
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If,  after  completion  of  this  stage,  it  still  seems  reasonable  to 
proceed,  the  second  stage  should  consist  of  prototype  hardware  develop- 
ment. The  first  phase  of  this  development  should  be  the  design  and 
construction  or  simulation  of  the  actual  processing  elements  to  be  used 
at  each  point  in  the  array.  Experimentation  with  these  elements,  to 
explore  their  properties  in  some  detail,  should  be  performed  before 
determination  of  the  final  design  of  the  elements  that  will  compose 
the  final  array  processor. 

Next,  we  recommend  the  construction  of  a small  experimental  array 
of  processors  (perhaps  10  x 10  or  20  x 20)  to  test  the  concept.  The 
individual  processors  should  be  the  type  proposed  for  the  large-scale 
machine  but  need  not  be  produced  to  the  final  physical  dimensions.  In 
particular,  small  packaging  and  other  problems  that  have  to  be  addressed 
only  for  the  final  machine  could  be  avoided  in  the  initial  exploration. 

This  small  experimental  array  will  be  invaluable  because  it  will 
permit  the  acquisition  of  hard  data  on  the  suitability,  cost,  and  re- 
liability of  current  technology  when  used  in  the  development  of  this 
concept.  The  results  of  simulations  and  modeling  of  algorithms  can  also 
be  tested  against  reality.  Finally,  solutions  developed  during  earlier 
phases  of  the  study  for  problems  associated  with  hardware/software 
diagnostics,  input/output,  interfacing,  data  management,  etc.,  can  be 
tested  on  real  hardware  at  low  cost. 

The  completion  of  these  two  design  stages  will  provide  the  infor- 
mation necessary  to  decide  whether  it  is  appropriate  to  enter  a detailed 
design,  development,  and  construction  phase  for  a full-scale  array  pro- 
cessor machine  to  solve  the  Navier-Stokes  equations. 
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1.  A STRAW-MAN  COMPUTER  FOR  NAVIER-STOKES  SIMULATION— 
A PAPER  PRESENTED  AT  THE  NAVIER-STOKES  WORKSHOP 
HELD  AT  RAND  9-10  MARCH  1977 
Ivan  E.  Sutherland 
California  Institute  of  Technology 


A STRAW-MAN  COMPUTER 

This  note  describes  a specific  computational  system  believed  to 
be  relevant  to  the  Navier-Stokes  problem.  This  straw-man  design  was 
presented  at  the  opening  of  the  conference  to  serve  as  a specific  point 
of  discussion  during  working  group  deliberations.  A straw  man  provides 
us  a machine  of  sufficient  complexity  to  help  us  determine  whether  the 
computations  that  can  be  handled  by  such  a machine  or  a scaled-up  ver- 
sion are  of  interest  from  an  applications  point  of  view. 

While  the  straw-man  design  may  not  be  correct  in  detail,  it  never- 
theless presents  the  appropriate  constraints  on  the  design  of  special- 
ized computing  machinery  that  must  be  considered  in  problem  formulation 
and  algorithm  development.  The  understanding  of  the  design  limitations 
in  such  a machine  and  the  implications  of  such  limitations  on  the  forms 
of  computations  will  form  an  important  part  of  any  further  consideration 
of  such  machines. 

The  straw-man  design  consists  of  10,000  identical  integrated  cir- 
cuits— each  mounted  in  a 16-pin  integrated  circuit  package.  These 
10,000  circuits  are  arranged  in  a rectangular  array  100  * 100  occupying 
a space  approximately  8 feet  square.  One  can  imagine  the  entire  com- 
puter mounted  on  a wall.  Each  integrated  circuit  in  the  "straw-man" 
design  is  capable  of  storing  about  16,000  bits  of  information  arranged 
as  256  words  of  64  bits  each.  In  addition,  each  of  the  integrated 
circuits  is  capable  of  performing  a 64-bit  fixed-point  addition  in 
about  100  nanoseconds.  Each  of  these  computational  elements  receives 
identical  instructions  from  a common  central  source. 

In  order  to  reduce  the  wiring  congestion  of  the  machine,  each  of 
these  elements  communicates  only  with  its  nearest  neighbors.  Two  alter- 
native communication  schemes  are  possible  as  shown  in  Figs.  1 and  2. 
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Fig.  2b — Wiring  schematic  for  a large-array  computer 
( identical  outgoing  messages) 
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In  the  simpler  communication  scheme,  schematically  illustrated  in  Fig. 
la,  a one-way  reversible  communication  path  exists  between  each  adjac- 
ent pair  of  computing  elements.  This  communication  path  is,  in  fact, 
a single  wire  between  the  two  elements  as  shown  in  Fig.  lb.  In  this 
scheme,  each  computing  element  may  communicate  one  bit  at  a time  to 
or  from  an  adjacent  neighbor  but  cannot  send  and  receive  simultaneously. 
This  scheme  permits  a computing  element  to  send  up  to  four  messages 
simultaneously,  each  of  which  may  be  different.  An  alternative  scheme 
shown  in  Fig.  2 has  each  element  transmitting  an  identical  message  to 
its  four  immediate  neighbors  and  is  able  simultaneously  to  receive  a 
message  from  each  of  them.  This  scheme  requires  5 pins  on  the  inte- 
grated circuit  package  but  provides  for  twice  as  much  communication  so 
long  as  the  problem  can  make  use  of  identical  messages  transmitted  in 
all  four  directions. 

In  addition  to  this  simple  local  communication,  communication  lines 
from  a central  source  to  the  10,000  processors  are  provided.  One  hundred 
individual  row  wires  and  100  individual  column  wires  are  driven  by  the 
central  control  system  which  enable  it  to  designate  any  individual  pro- 
cessor for  special  attention.  The  row  and  column  wires  account  for  2 
additional  pins  on  the  package.  The  remaining  9 wires  on  the  integrated- 
circuit  package  are  used  to  obtain  power  and  ground  and  to  communicate 
7 command  signals  with  the  central  machine.  These  7 command  signals  go 
in  parallel  into  all  10,000  processors  and  instruct  them  as  to  what  step 
to  do  next. 

Notice  that  the  straw-man  machine  design  does  not  have  long  com- 
munication paths  within  the  array.  It  is  not  possible  for  an  element  to 
talk  to  another  element  8 or  10  positions  away  in  the  array  without 
transmitting  the  message  through  intermediate  computing  elements.  This 
feature  of  the  design  is  important  because  it  minimizes  the  number  of 
wires  required  to  interconnect  with  the  processing  elements  and,  thus, 
minimizes  the  cost  of  interconnection.  The  communication  costs  between 
the  processing  elements  could  easily  grow  to  be  a major  factor  were 
long-distance  communication  made  available. 

Three-dimensional  problems  are  handled  in  the  straw-man  machine  by 
using  the  storage  available  In  any  processor  to  represent  a set  of  data 
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elements  from  sample  points  that  lie  in  a column  in  the  volume  of  space 
in  which  computation  takes  place.  The  arithmetic  units  in  the  proces- 
sing elements  will  process  various  planes  of  the  solution  in  sequence 
using  data  from  different  places  in  the  column.  Thus,  each  processor, 
in  effect,  marches  up  and  down  its  column,  doing  its  computations  at 
the  same  time  its  neighbors  are  marching  up  and  down  their  columns 
performing  similar  computations  there.  In  part,  the  size  of  the  prob- 
lem which  can  be  attacked  is  limited  by  the  number  of  storage  elements 
available  and,  thus,  the  number  of  elements  in  the  column  which  can  be 
adequately  represented. 

Input/output  to  the  array  can  be  accomplished  through  appropriate 
use  of  row  and  column  wires.  Individual  messages  to  individual  proces- 
sors may  also  be  passed  into  the  array  by  designating  the  recipient 
with  the  row  and  column  wires  and  sending  the  appropriate  message  along 
the  many  available  broadcast  wires.  All  processors  will  receive  the 
message,  but  only  the  designated  one  will  act  on  it. 

Output  from  the  array  processors  might  be  greatly  facilitated  by 
including  a light  emitting  diode  on  each  of  the  integrated  circuit 
packages.  Thus,  each  of  the  10,000  integrated  circuits  could  light  up 
under  suitable  conditions.  For  example,  one  could  have  any  processor 
where  the  pressure  exceeded  a certain  threshold  light  up  and,  thus, 
watch  the  play  of  high-pressure  regions  as  the  problem  solution  proceeds. 
Since  the  output  of  many  computations  suitable  to  array  processing  is 
appropriately  represented  in  pictorial  form,  the  notion  of  generating 
the  pictures  in  situ  and  using  optical  communication  photography  as 
the  direct  output  medium  makes  good  sense. 

The  challenge  in  using  this  straw  man  to  solve  problems  is  to 
couch  the  problem  in  terms  which  do  not  require  long-distance  communi- 
cation since  long-distance  communication  is  fundamentally  expensive  from 
the  computer  architecture  point  of  view.  One  hopes  that  the  solution 
technique  for  solving  the  Navier-Stokes  problem,  by  its  very  nature,  can 
easily  be  couched  in  such  terms  that  long-distance  communication  is  not 
necessary.  The  flexibility  of  communicat ion  available  in  a random-access 
memory  has  lowered  the  sensitivity  of  most  computer  people  to  the  very 
real  cost  of  communication.  Our  major  exercise  during  this  conference  is 
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to  examine  the  match  between  the  chosen  problem  area,  the  Navier-Stokes 
equations,  and  the  hypothetical  machine  architecture,  considering 
principally  the  communication  limitations  which  will  exist.  The  ex- 
tent to  which  we  are  successful  in  mapping  the  geometry  of  the  problem 
area  onto  the  geometry  of  the  wiring  provided  in  the  computer  is  the 
extent  to  which  we  will  be  successful  in  obtaining  the  computational 
gains  promised  by  the  "army  of  ants"  architecture.  Should  we  fail  to 
map  the  problem  domain  onto  the  specific  configuration  of  the  machine, 
we  will  fail  to  reap  its  benefits. 

The  potential  benefits  of  the  architecture  described  are  enormous. 

Although  each  processing  element  is  capable  of  performing  a 64-bit 

fixed-point  addition  in  100  nanoseconds  and,  thus,  will  require  6.4 

microseconds  to  perform  a multiplication,  the  array  of  10,000  such 

9 

elements  performs  at  a rate  of  approximately  1.5  x 10  multiplications 
per  second,  a very  impressive  computational  power  indeed.  If  we  assume 
for  the  moment  that  these  simple  integrated  circuits  can  be  built  for 
$10  per  chip,  we  are  talking  about  a spectacular  cost/performance  ratio 
and  herein  lies  the  appeal  of  the  architecture.  The  challenge,  of 
course,  is  to  ge : the  individual  computing  elements  to  work  success- 
fully together. 

THE  STRAW  MAN  REVISITED 

During  the  two-day  conference,  our  discussions  of  problem  areas 
and  technological  feasibility  led  me  to  a modification  of  the  straw- 
man  architecture.  I am  convinced,  first  of  all,  that  the  amount  of 
storage  available  at  each  node  must  be  maximized  since  storage  is  the 
major  limitation  on  the  size  of  the  problem  which  can  be  attacked. 

Each  processing  element,  I believe,  should  have  as  many  bits  of  memory 
as  possible  even  if  those  bits  of  storage  are  somewhat  Inconvenient  to 
use.  For  example,  one  might  use  a serial  storage  device  such  as  a 
charge-coupled  device  (CCD)  memory.  In  addition  to  a large  serial  store 
of  perhaps  one  or  two  hundred  thousand  bits,  each  processing  element 
should  have  random  access  memory  of  perhaps  as  few  as  32  64-bit  words. 
The  arithmetic  capability  of  a single  adder  per  processing  element  seems 
appropriate  to  our  present  purpose  since  to  incorporate  additional 
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arithmetic  elements  will  increase  the  design  cost  substantially  and  not 
change  the  performance  characteristics  very  much. 

Each  processor  should  include  some  space  for  storing  algorithms 
and  data.  This  feature  is  important  because  it  permits  individual  pro- 
cessors to  be  distinguished  from  each  other  not  only  in  their  position 
in  the  array  and  the  data  which  they  contain,  but  also  in  terms  of  the 
specific  functions  they  perform.  These  distinctions  between  processors 
will  give  the  system  an  enormous  flexibility,  flexibility  well  worth 
the  cost  in  storage  and  logical  complexity  required  to  implement  the 
functions.  If  each  machine  is,  in  effect,  a stored  program  micro- 
computer, the  variety  of  functions  available  to  the  array  is  greatly 
increased,  since  the  processors  need  not  function  identically. 
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11 . A VARIANT  ON  IVAN  SUTHERLAND'S  PROPOSAL 
FOR  A NAVIER-STOKES  COMPUTER 
R.  Stockton  Gaines 
The  Rand  Corporation 


I would  like  to  suggest  a variation  on  the  proposal  made  by  Ivan 
Sutherland.  I would  then  argue  that  on  the  basis  of  this  suggestion 
and  some  of  the  ideas  presented  in  the  talks  at  the  workshop,  one  can 
accept  the  hypotheses  that  a special-purpose  computing  machine  with 
a computing  capacity  of  at  least  an  order  ot  magnitude  superior  to  cur- 
rent supercomputers  can  be  built  at  a cost  substantially  less  than  cur- 
rent supercomputers  and  that  such  a machine  can  be  considered  in  the 
next  3 to  $ y ars. 

The  Sutherland  design  proposed  a 100  by  100  array  of  small  proces- 
sors manufactured  on  individual  chips.  The  power  of  each  of  these 
10,000  processors  would  essentially  be  that  of  a current  lb  K memory 
chip  with  a little  bit  of  additional  logic  to  allow  an  adder  and  a 
one-stage  shifter  to  be  incorporated  on  the  chip.  Ivan  proposed  that 
such  a chip  could  be  produced  within  the  next  few  years  at  a cost  of 
about  $10  a chip,  which  seems  to  be  a reasonable  prediction  given  the 
current  state  of  integrated-circuit  technology.  This  array  would  be 
controlled  in  a broadcast  mode  so  that  all  processors  in  the  array  would 
either  carry  out  no  activity  or  the  same  activity  at  any  individual  time 
step  in  the  process  of  calculating  a result.  Each  chip  would  have  the 
ability  to  communicate  the  contents  of  any  portion  of  its  memory  to  its 
adjacent  neighbors  in  the  array. 

I wish  to  suggest  that  there  should  be  substantially  more  computing 
power  in  each  node  of  the  array.  The  presentation  by  INTEL  at  the  work- 
shop offered  the  hypothesis  that  within  five  years  INTEL  will  be  able  to 
manufacture  a single  chip  with  computing  power  approximately  equivalent 
to  a PDP-11/70  for  a cost  of  about  $100  per  chip.  On  the  basis  of  the 
details  of  the  INTEL  presentation  and  other  discussions  during  the  work- 
shop, I believe  it  is  reasonable  to  suggest  that  the  processing  power 
at  each  node  of  the  array  be  substantially  more  powerful  than  the  rather 
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simple  computing  capacity  of  the  Sutherland  proposal.  In  particular, 

I suggest  that  we  consider  that  the  computing  power  at  each  node  of 
the  array  be  equivalent  to  a small  computer  that  can:  (a)  perform 
the  small  set  of  operations  needed  to  compute  finite-difference  approx- 
imations, (b)  store  the  amount  of  information  necessary  for  computing 
tasks  at  one  point  in  the  array,  and  (c)  communicate  with  its  neighbors 
as  well  as  in  broadcast  mode  to  the  entire  array  or  to  the  controller 
at  the  edge  of  the  array.  I have  been  informed  that  it  would  be  reason- 
able to  put  together  a small  processor  on  a board  consisting  of  a few 
chips  made  ouc  of  current  technology  which  might  have  this  power  already 
for  a cost  of  approximately  $500  in  today's  market.  Such  a processor 
would  have  one  chip  for  communications,  one  chip  for  processing  and 
several  chips  for  memory.  The  amount  of  memory  that  would  be  necessary 
wontd  be  enough  to  store  several  variables  _fTrr''each--poijut.J.n_the  third 
dimension  of  a three-dimensional  space.  The  amount  of  memory  to  be 
stored  on  each  chip  would  be  about  the  same  as  in  the  Sutherland  pro- 
posal; the  main  difference  being  proposed  here  is  the  amount  of  com- 
puting power  that  should  be  available  at  each  point  in  the  array. 

There  should  be  storage  at  each  point  in  the  array  for  a small 
number  of  program  steps,  say  100,  which  would  be  roughly  equivalent  to 
the  inner  DO-loop  of  a finite-difference  approximation  program.  Since 
the  computer  at  each  point  would  not  need  to  store  many  instructions 
or  many  data  (compared  with  general-purpose  computers) , the  space 
necessary  to  store  a program  would  be  quite  small.  The  advantage  of 
storing  the  program  in  each  chip  is  that  different  chips  in  the  array 
can  be  doing  different  activities  at  the  same  time.  Since  an  actual 
problem  involves  computations  on  the  fluid  for  which  the  finite- 
difference  approximation  holds,  while  at  the  same  time  carrying  out 
special  computations  at  all  boundaries  (external  and  internal) , external 
control  will  mean  that  for  the  greater  portion  of  the  computing  time 
most  of  the  elements  of  the  array  will  be  shut  off  while  special  boundary 
conditions  are  being  computed  in  part  of  the  array.  The  effective  com- 
puting capacity  of  the  array  as  a whole  could  be  reduced  by  a factor  of 
6 or  more  for  many  problems  because  of  this. 

It  seems  likely  that  a special-purpose  computer  unit,  designed  to 
hold  a small  program  and  carefully  tailored  to  performing  just  those 
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computational  activities  appropriate  to  finite-difference  approximations, 
would  not  be  too  difficult  to  design  given  today's  technology,  and 
could  be  produced  in  quantity  at  costs  comparable  to  those  we  were  talk- 
ing about  for  other  chips. 

The  presentation  by  INTEL  lends  credence  to  the  likelihood  that 
powerful  computing  chips  will  be  available  at  reasonable  costs.  In 
addition,  1 have  discovered  that  there  is  already  a chip  available  for 
about  $100  that  contains  a NOVA  computer  on  it.  Other  discussions 
during  the  workshop  also  suggest  that  chips  such  as  those  discussed  are 
feasible  to  produce. 
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AGENDA 

NAVIER-STOKES  COMPUTER  WORKSHOP 

The  Rand  Corporation 
Santa  Monica,  California 
March  9-10,  1977 

Wed nesday , Mar ch  9 , 1977 

9:00  Introduction  to  Rand  D.  Rice,  E.  C.  Gritton 

Background,  Problem  Definition,  S.  J.  Lukasik 

and  Summary  of  Computational 
Requirements 

9:45-12:00  COMPUTER  TECHNOLOGY  (I.  Sutherland) 

State-of-Art  and  Technology  W.  Pohlman,  INTEL 

Projections 

Computer  Architecture  and  Multi-  I.  Sutherland,  Cal  Tech 
Processor  Design 

Summary  of  Experience  with  D.  Stevenson,  NASA/IAC 

ILL I AC 

Summary  of  Experience  with  STAR  P.  J.  Bobbitt,  NASA/ 

Langley 

Discussion 
12:00-1:00  Lunch 

1:00-3:00  NUMERICAL  SIMULATIONS  (W.  S.  King) 

Numerical  Algorithms  and  Solu-  W,  S.  King,  M.  Juncosa, 

tion  Techniques  Rand 

Navier-Stokes  Simulator  C.  E.  Grosch,  Old 

Dominion 

Large  Eddy  Turbulence  Simulation  J.  Ferziger,  Stanford 

3:00-5:00  FLUID  MECHANICS  APPLICATIONS 

(C.  Gazley,  Jr.) 

Applications  and  Problem  Areas  C.  Gazley,  Jr.,  Rand 

Application  to  Experimental  D.  E.  Coles,  Cal  Tech 

Facilities 

Computational  Aerodynamic  Design  F.  R.  Bailey,  NASA/ 
Facility  Ames 


Discussion 
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5: 30  Social  Hour 

Thursday , March  10,  1977 

8:30-9:00  Introduction  to  Working  Group 

Session  (Main  Conference  Room) 

9:00-12:00  Working  Group  Meetings: 

— Computer  Technology  R.  Rice 

— Application  Areas  C.  Gazley,  Jr. 

— Numerical  Simulations  W.  S.  King 

12:00-1:00  Lunch 

1:00-5:00  Report  of  Working  Groups 

(E.  C.  Gritton,  Chairman) 

Discussion  and  Final 
Recommendations 


NAVIER-STOKES  COMPUTER  WORKSHOP 


March  9-10,  1977 


PARTICIPANTS 


Dr.  F.  Ronald  Bailey 

Thermo  & Gas  Dynamics  Division 

Mail  Stop  229-3 

Ames  Aeronautical  Laboratory 

Moffett  Field,  California  94035 

(415)  965-6419 

Mr.  Percy  J.  (Bud)  Bobbitt 

Head,  Theoretical  Aerodynamics  Branch 

STAD 

NASA  Langley  Research  Center 

Mail  Stop  360 

Hampton,  Virginia  23665 

(804)  827-1110 

k 

Professor  Donald  E.  Coles 
Aeronautics  Department,  205-50 
California  Institute  of  Technology 
Pasadena,  California  91125 

(2.13)  795-681  1 

Dr.  George  L.  Donohue 
Defense  Advanced  Research  Projects 
Agency 

1400  Wilson  Boulevard 
Arlington,  Virginia  22209 

(202)  694-1903 

* 

Professor  Joel  Ferziger 
Department  of  Mechanical  Engineering 
Stanford  University 
Stanford,  California  94305 

(415)  497-3615 

Dr.  S.  Fernbach 

Lawrence  Livermore  Laboratory 

P.  0.  Box  808 

Livermore,  California  94550 
(415)  447-1100,  X3767 


Speaker . 


Dr.  R.  Stockton  Gaines 

Information  Sciences  Department 

The  Rand  Corporation 

1700  Main  Street 

Santa  Monica,  California  90406 

(213)  393-0411 

★ 

Dr.  Carl  Gazley,  Jr. 

Physical  Sciences  Department 

The  Rand  Corporation 

1700  Main  Street 

Santa  Monica,  California  90406 

(213)  393-0411 

k 

Dr.  Eugene  C.  Grit ton 
Physical  Sciences  Department 
The  Rand  Corporation 
1700  Main  Street 

Santa  Monica,  California  90406 
(213)  393-0411 

•k 

Professor  Chester  E.  Grosch 
Institute  of  Oceanography 
Old  Dominion  University 
Norfolk,  Virginia  23508 

(804)  489-6477 
Dr.  C.  W.  Hirt 

Los  Alamos  Scientific  Laboratory 
Mail  Stop  216 
P.  0.  Box  1663 

Los  Alamos,  New  Mexico  87445 
(505)  667-4156 

Dr.  Mario  L.  Juncosa 

Physical  Sciences  Department 

The  Rand  Corporation 

1 700  Main  Street 

Santa  Monica,  California  90406 

(213)  393-0411 
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Dr.  Michael  Kascic 

STAR  Operations  Division 

Control  Data  Corporation 

4290  Fernwood  Avenue 

Arden  Hills,  Minnesota  55112 

(612)  482-2100 

Professor  H.  Keller 
Applied  Mathematics  Department, 
101-50 

California  Institute  of  Technology 
Pasadena,  California  91125 

(213)  795-6811 

* 

Dr.  William  S.  King 

Physical  Sciences  Department 

The  Rand  Corporation 

1700  Main  Street 

Santa  Monica,  California  90406 

(213)  393-0411 

Dr.  Richard  Lau 
Office  of  Naval  Research 
1030  East  Green  Street 
Pasadena,  California  91106 

(213)  681-5264 

* 

Dr.  Stephen  J.  Lukasik 
S.  J.  Lukasik,  Ltd. 

8400  Westpark  Drive 
McLean,  Virginia  22101 

(703)  356-4490 

Dr.  David  D.  Loendorf 
Institute  for  Computer  Applications 
in  Science  & Engineering 
NASA  Langley  Research  Center 
Mail  Stop  246 
Hampton,  Virginia  23665 

Mr.  Marshall  Pease 
Stanford  Research  Institute 
333  Ravenswood 

Menlo  Park,  California  94025 
(415)  326-6200,  X4123 


k 

Mr.  William  Pohlman 
INTEL 

3065  Bowers 

Santa  Clara,  California  95051 
(408)  246-7501 

Mr.  Rex  Rice 
Fairchild  Memory  Systems 
1725  Technology  Drive 
San  Jose,  California  95110 

(408)  998-0123,  X474 

Professor  Philip  Saffman 
Applied  Mathematics  Department,  101-50 
California  Institute  of  Technology 
Pasadena,  California  91125 

(213)  795-6811 

Dr.  John  Steinhoff 
Research  Department 
Plant  35 

Grumman  Aerospace  Corporation 
Bethpage,  New  York  11714 

(516)  575-0574 

k 

Dr.  David  Stevenson 
Institute  of  Advanced  Computation 
1095  East  Duane  Avenue 
Sunnyvale,  California  94086 

(408)  735-0635 

Mr.  Jon  Surprise 
Department  470-G3 
Goodyear  Aerospace  Corporation 
Akron,  Ohio  44315 

(216)  794-2804 

k 

Dr.  Ivan  Sutherland 
Computer  Science  Department,  256-80 
California  Institute  of  Technology 
Pasadena,  California  91125 

(213)  795-6811 

Mr.  Charles  Thacker 

Xerox  Palo  Alto  Research  Center 

3333  Coyote  Hill  Road 

Palo  Alto,  California  94301 

(415)  494-4000 


Dr.  Harold  Petersen 

Physical  Sciences  Department 

The  Rand  Corporation 

1700  Main  Street 

Santa  Monica,  California  90406 

(213)  393-0411 
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IN  A CELL  COMPUTER:  A TEST  CASE 

Chester  E.  Grosch 
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SOLUTION  OF  THE  NAVIER-STOKES  EQUATIONS 
IN  A CELL  COMPUTER:  A TEST  CASE 

Chester  E.  Grosch 
Old  Dominion  University 


C- 1.  INTRODUCTION 

Theoretical  fluid  dynamicists  have  looked  to  ever  larger  and 
faster  general-purpose  computers  to  solve  the  Navier-Stokes  equations 
using  numericai  methods.  Increasing  size,  i.e.,  memory,  permits  a 
finer  spatial  resolution  for  a fixed  volume  of  fluid  and/or  the  calcu- 
lation of  flows  in  larger  volumes.  The  number  of  operations  (addi- 
tions, multiplications,  memory  fetches,  etc.)  per  time  step  increases 
at  a slightly  faster  rate  than  the  number  of  mesh  points  (or  modes,  for 
spectral  methods).  In  fact,  it  is  easily  shown  that  if  N*  mesh  points 
are  used  in  a calculation  the  number  of  operations  per  time  step  is 
proportional  to  N'V-^N  for  the  best  methods. 

While  there  have  been  some  improvements  in  algorithms,  progress 
has  been  tied  to  the  development  of  ever  faster  general-purpose  computers. 
The  most  recent  "supercomputers"  incorporate  substantial  amounts  of 
pipelining  and  parallelism  in  their  computer  processing  units  (CPUs)  in 
addition  to  using  faster  processing  elements.  In  short,  high-speed 
computation  has  been  sought  via  faster  and  more  complex  CPUs.  This  is 
true  even  for  ILLIAC  IV.  Each  of  the  64  processing  elements  (PEs)  is 
a quite  powerful  general-purpose  computer. 

It  is,  of  course,  true  that  any  computer  which  is  equivalent  to  a 
Turing  machine  is  a general-purpose  computer.  However,  it  is  equally 
true  that  a given  computer  architecture  will  handle  certain  classes 
of  algorithms  better  than  others.  It  appears  that  all  of  today's 
"supercomputers,"  including  ILLIAC  IV,  were  designed  to  handle  many 
different  classes  of  scientific  computing  problems.  The  cost  of  this 
versatility  appears  to  be  that  these  architectures  are  not  nearly 
optimum  for  any  one  class  of  problems. 

Dr.  Sutherland  has  suggested  an  architecture  which  appears  to  be 
much  closer  to  optimum,  for  computational  fluid  dynamics,  than  that  of 
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existing  large  computers.  This  architecture  is  a two-dimensional 
array  of  cells.  Each  cell  contains  a simple,  and  rather  slow,  arithme- 
tic unit  and  a modest  amount  of  memory.  The  potential  advantages  of 
this  architecture  are  threefold:  first,  it  appears  possible  to  build 
such  a computer  using  existing  technology  (each  cell  is  only  a few 
chips,  perhaps  a single  chip)  at  fairly  modest  cost,  particularly  if 
the  intercellular  connections  are  minimized;  second,  technological 
developments  in  the  semiconductor  industry  are  leading  toward  increas- 
ing complexity  and  density  per  chip  and  lower  cost  per  chip;  and  third, 

2 

if  the  array  of  cells  can  be  mapped  onto  the  fluid  domain  so  that  N 

2 

operations  can  be  performed  in  parallel  on  N chips,  the  number  of 

3 

sequential  operations  per  time  step  can  be  reduced  from  0(N  f^N)  to 
O(Nf^N)  . 

This  architecture  appears  to  be  quite  promising,  at  least  for  the 
finite-difference  methods  commonly  used  in  computational  fluid  dynamics. 
The  flow  field  is  represented  by  a three-dimensional  array  of  fluid 
cells  so  that  each  cell  in  the  planar  array  of  computer  cells  could 
represent  one  or  more  of  the  fluid  cells.  The  common  algorithms  to 
compute  the  flow  field  are  largely  parallel  in  that  the  operations 
performed  at  any  one  point  in  a space  to  compute  the  velocity  field  are 
the  same  operations  performed  at  nearly  all  other  points  in  the  space. 

There  are,  of  course,  certain  problems.  Fluid  cells  on  the  bound- 
aries of  the  computational  region  are  inherently  different  from  fluid 
cells  in  the  interior  (hence  the  "nearly  all"  used  above) . The  pressure 
field,  for  incompressible  flow,  is  global,  i.e.,  the  pressure  at  any 
point  depends  on  the  velocity  field  everywhere. 

Nevertheless,  this  architecture  appears  so  promising  that  it  is 
worthwhile  to  examine  a test  case.  The  test  case  considers  the  incom- 
pressible flow  of  a fluid  in  a boundary  layer  adjacent  to  a rigid, 
impermeable,  no-slip  wall. 

The  objectives  of  this  exercise  are  to  determine: 

1.  How  well  a standard  algorithm  fits  a cell  computer. 

2.  What  modifications  to  the  algorithm  are  required. 

3.  Which  part,  if  any,  of  the  algorithm  dominates  the  calculation. 
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4.  Which  operation  (transfer,  addition,  etc.)  dominates  the 
calculat ion . 

5.  How  many  operations  are  required  per  time  step. 

6.  What  is  the  memory  requirement  per  cell. 

7.  What  is  the  computation  time  per  time  step,  using  conserva- 
tive estimates  of  the  transfer,  multiply,  and  add  times. 

C-2_. EQUATIONS  OF  MOTION  AND  BOUNDARY  CONDITIONS 

The  geometry  of  the  flow  to  be  calculated  is  shown  in  Fig.  3. 

The  computational  region  isOsxSx  , 0 £ y £ “°,  0 £ z s z . The 

o o 


Fig.  3 — Flow  geometry 
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equations  of  motion  are,  of  course,  the  Navier-Stokes  equations  for 
an  incompressible  fluid 

f^r  + (U  • V)u  = -Vp  + i V u (1) 

V • u = 0 (2) 

Here  R = 11^6/v  is  the  Reynolds  number,  where  Uq  is  a characteristic 
free-stream  speed,  6 is  the  boundary-layer  thickness,  and  v is  the 
kinematic  viscosity.  The  velocity 

u = iu  + jv  + kw  (3) 


has  components  (u,v,w)  in  the  x,  y,  and  z directions  (i,  j,  and  k are 
unit  vectors),  and  p is  the  pressure  per  unit  density. 

A Poisson  equation  for  the  pressure  can  be  obtained  by  taking  the 
divergence  of  Eq.  (1) 


(4) 


with 


Q = E - S 


(5) 


= I v2d  - f 

r at 


(6) 


S = 7 • [<u  • V)u 


(7) 


D = V 


u 


(8) 
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The  divergence,  D,  must  be  zero  from  Eq.  (2),  but  E is  retained  on 
the  right-hand  side  of  the  Poisson  equation  in  order  to  correct,  at  each 
time  step,  for  roundoff  and  truncation  errors  which  cause  D # 0.  This  is 
not  an  artificial  viscosity  term  because  the  object  is  to  force  D,  and 
hence  E,  to  be  zero  at  each  time  step.  In  a sense  the  Poisson  equation 
with  E plays  the  role  of  the  corrector  in  a predictor/corrector  scheme. 

The  boundary  conditions  are: 

u and  given  at  x * 0 


u = 0 at 

y = 

0 

^ = v A 

3y  8y2 

at 

y = 0 

u -*•  U(x,  z) 

as 

y oo 

v -*■  V(x,z) 

as 

8 

t 

w -*■  0 

as 

y -►  oo 

i£->° 

3y 

as 

y -*■  OO 

u and  p periodic  in  z 


(10a) 

(10b) 

(11a) 

(lib) 

(11c) 

(lid) 

(12) 
(13a) 
x = xq  (13b) 


Here  U and  V are  the  free-stream  velocity  components  and  are  0(1) 
because  of  the  scaling  with  Uo.  The  periodic  boundary  conditions  in 
z are  a convenience,  not  a necessity.  Either  f low-through,  free-slip 
conditions  or  solid-wall  conditions  could  have  been  used  for  the  z 
boundary  conditions.  In  either  case  the  final  results  (number  of 
operations  per  time  step)  would  have  been  changed  by  less  than  ten  per 


cent . 
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The  "correct"  outflow  boundary  conditions  are  not  known,  but  it  is 
known  that  "reasonable"  outflow  conditions,  such  as  that  used  here  at 
x = x , cause  a boundary  layer  to  form  at  the  rear  of  the  computational 
region.  Its  thickness,  in  the  upstream  direction,  is  0(6). 

In  order  to  represent  the  infinite  physical  domain  0 <_  y < «°  in 
the  finite  computational  domain  a mapping  is  used 

C = y/(y  + D (14) 

The  value  chosen  for  the  scale  factor,  l,  determines  the  resolution  in 
the  boundary  layer,  because  (in  the  scaled  variables)  y = 1 is  the  top 
of  the  boundary  layer  which,  in  the  mapped  variable,  is  at 

5 = 1/d  + O 


This  maps  the  infinite  domain  (0  y < °°)  onto  the  finite  domain  (0  £ 5 < 1). 
With  this  mapping  the  derivatives  have  a simple  form 

3_  _ (1  - Q2  3_ 

3y  l 35  (15i*) 


(1  - .O' 


3y 


-2(1  - 5)  It  + (1  - 5)2  ^ 
0<*  352J 


(15b) 


It  has  been  shown15  that  this  mapping  yields  highly  accurate  results 
with  relatively  few  grid  points  in  those  cases,  as  in  this  problem,  where 
the  flow  field  at  infinity  is  a simple  laminar  flow.  The  only  cost  is 
that  the  metric  coefficients  must  be  stored  in  each  cell. 


C-3.  DIFFERENCING  METHODS 

The  physical  space,  0 £ x £ xq,  0 S ( i 1,  zc > is  divided  into 

L • M • N cells  centered  on  the  points 


-47- 


The  spatial  differerr  ' scheme  is  of  the  centered  second-order 
type.  As  an  example,  coil  the  u-component  of  Eq . (1)  written  in 
conservation  form 


3u2  + ( 1 - £2)  3uv  + 3uw  + 3£ 

3t  3x  l 3£  3z  3x 


3u 


i ) i!u  * ii 


3x 


--LL  - i_ 

o 2 H 


[d  - O' 


3u-i  . 3 u f 


(18) 


The  semi-discrete  approximation  for  u 


is 


A 


3t 


(Ax)  j _(ui,j,k  " Ui-l,j,k) 

3j  ^Ui-*S,  j+4,kVi-*s , j+*s,k  Ui-‘i,  j-S,kVi-^,  j-^.k^ 
(_Af)(ui-is,j,k+4Wi-^,j,k4i5  ~ ui-l5,j,k-J5Wi-J5,j,k-!i) 


(pi,j,k  “ Pi-l,j,k) 

+ 7a7)R  [Ui+«5,j,k  “ 2ui-I-4,j,k  + Ui—  3/ 2 , j ,k 

+ VV^.j+l.k  " 2ui-1S, j ,k  + Ui-*S,j-l,k) 

" Cj(Ui-4,j+l,k  " Ui-5S,j-l,k) 

+ (a7)  (ui-l5,j,k+l  _ 2ui-is,j,k  + Ui_4 , j -1 , k}  ] j 


(19a) 


where 

aj  = (sf>  [(1  " °2/*]j 

bj = (ff>2  [<1  ■ °2*2jj 
cj  = (tt)2  [(1 " °3  *al2]j 


(19b) 

(19c) 


(19d) 
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and  the  subscript  on  the  a^  , b^  , and  c means  that  these  functions  are 
evaluated  at  £ = j • A£.  Note  that  certain  of  these  variables,  such  as 
ui  i v’  are  not  directly  available.  They  are  defined  as  the  average  of 
the  variable  at  adjacent  grid  points, 


Ui.j.k  2^ui+^,j,k  + Ui-^,j,k^ 


(20) 


It  can  be  shown  that  this  approximation,  when  applied  to  Eqs.  (1)  through 
(8)  with  boundary  conditions  (9)  through  (13),  is  conservative  in  the 
sense  that  mass  is  conserved  for  any  R;  and,  in  the  limit  R •+  »,  momentum, 
energy,  and  ens trophy,  are  also  conserved. 

The  time  differencing  is  of  the  Adams-Bashforth  type.  Let  us  define 


,k 


13u2  (1  - g)2  3uv  _ 3uw  _ j)£ 

3x  i 3£  3z  3x 


32u 

3x2 


(l  - o 


-let- 


(l  - o' 


3u 

3?  . 


i— *5*  J 


(21) 


the  righthand  side  of  (19''  defined  at  (subscripts)’  the  point  (iAx,  iA£,  kAz) 
in  space  and  at  (superscript)  time  nAt.  k an<^  ^i  j k-'-s  are  ^e^^ned 

in  a completely  analogous  way.  Then  the  Adams-Bashforth  approximation  to 
the  time  derivative  is 


n+l  _ n .At  pn-l  . 

Ui-4,j,k  ~ Ui-S,j,k  2 UFi-»s,Jfk 


(22a) 


vn+1  , = v11  , + (3Fn  - Fn_1  i 

i.j-^.k  2 1,  j-ii,k 


(22b) 


n+l  _ n . At_  /-,-n  _ Fn_i  ) 

Wi,j,k-^  i.J.k-4  2 i.j.k-H  i.j.k-1! 


r.n-1 


(22c) 
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C-4 . SCHEMATIC  DESCRIPTION  OF  THE  CELL  COMPUTER  AND  THE  DATA  STRUCTURE 

A schematic  diagram  of  the  cell  computer  is  shown  in  Fig.  5.  There 
is  a two-dimensional  array  of  L*M  cells.  Each  cell  connects  (can  commu- 
nicate directly  in  the  sense  of  interchanging  data)  with  its  four  nearest 
neighbors  "above"  and  "below"  and  to  the  "left"  and  "right."  In  order 
for  cell  (i,j)  to  communicate  with  cell  (i-l,j+l),  for  example,  the  data 
must  be  transferred  in  a two-step  operation;  from  (i,j)  to  (i,j+l)  or 
(i-l,j)  in  step  one  and  then  to  (i-l,j+l)  in  step  two.  It  is  clear,  assuming 
that  communication  is  slow,  that  local  communication  is  relatively  cheap 
and  long-range  communication  may  be  prohibitively  expensive.  It  is 
assumed  that  the  left  and  right  ends  and  the  top  and  bottom  of  this 
cellular  array  are  connected  so  that  the  array  is  equivalent  to  a torus. 

A single  cell  is  assumed  to  contain  some  memory,  an  adder,  and  some 
registers.  The  data,  Vi,j-*S,k’  etc*  are  stored  in  memory.  The 

registers  are  used  as  working  memory  to  store  intermediate  results,  and 
the  adder  is  used  to  perform  binary  addition  and  multiplication  by  shift 
and  add.  Provided  that  we  precompute  and  store  l/H,  1/Ax,  etc.,  we  never 
need  to  divide.  No  function  evaluations  are  required. 

A column  is  defined  as  all  those  fluid  cells  at  constant  x;  a row  as 
all  those  fluid  cells  at  constant  and  a rod  as  all  those  fluid  cells 
at  constant  z.  It  will  be  assumed  that  one  rod  of  data,  the  (i,j)  rod, 
is  stored  in  cell  (i,j),  that  is 


i-'sO.k’  IJ-H.k’ 


Vi.k-v  pi,j,k’  Di,j,k*  etc- 


for  k = 1,  2,  3,  . . . , N is  stored  in  cell  (i,j).  This  of  course  implies 
that  cell  u,j)  has  some  multiple  of  N words  of  memory  plus  sufficient 
memory  for  constants  such  as  the  metric  coefficients,  etc. 

Not  shown  in  Fig.  5 are  the  control  word  and  status  word  paths. 
Instructions  are  sent,  in  parallel,  from  a central  controller  to  all  cells. 
Each  cell,  depending  on  control  bits  stored  at  initialization  of  the  cell, 
either  performs  that  instruction  or  does  nothing.  Each  cell  can  send  back 
to  the  controller  a status  word.  All  initialization,  data,  constants,  and 
control  bits  are  sent  from  the  controller  to  the  cells.  All  output  from 
the  cells  is  sent  via  the  central  controller. 


ONE  CELL 


Fig.  5 — Schematic  repre»entation  of  the  cell-array  computer 


— 
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C-5. SAMPLE  CALCULATIONS 

In  order  to  advance  the  calculation  one  time  step,  a number  of 
quantities  must  be  calculated.  Among  these  are  the  set  of  D 

i . j ,k 

divergences  in  each  fluid  cell.  D.  . , is  defined  bv 

i.J.k 


Di,j,k  = ^{Vj.k  - Vh.j.k  + aj(vi,j^,k  ' "ij-^ 

+ (7^)  (w  . ^ - W.  . ,)1 

Az  i,J,k+*s  i.j.kJi'f 


(23) 


with 


(24) 


It  will  be  assumed  that  the  data  paths  within  the  cell  are  from  memory 

to  the  register  bank  and  back,  and  from  any  two  registers  to  the  adder  and 

back  to  any  register.  If  it  is  assumed  that  (1/Ax),  a.,  (Ax/Az),  u.  , . , 

1 1 — » J » k 

and  are  already  in  register,  then,  for  fixed  k,  one  in- 
cell transfer  1 1 1 1 is  neeHeH  fn  net  ...  nnj  »..c  "-ii  tc  cell  transit.  ^ 

° " i , j , k+  j 

(2T)  are  needed  to  get  u ^ v^  Once  these  are  in  the  registers, 

three  additions  (3a)  are  performed,  followed  by  two  multiplications  (2m), 

two  more  additions  (2a),  and  finally  one  more  multiplication  (lm).  This 

completes  the  calculation  of  D,  . , for  one  k. 

i.J.k 

This  must  be  repeated  for  each  cell  in  the  rod,  k = 1,  2,  ...,  N; 
although  all  cells  in  the  k'th  plane  are  calculating  in  parallel.  There- 
fore the  total  number  of  sequential  operations  which  are  required  for 

calculating  the  D.  . , for  all  i,j,  and  k is 
1 » J > k 


Nt,  2N,  5Na,  and  3Nm. 
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In  addition  to  the  velocity  components  each  cell  must  contain  the  constants 
(1/Ax),  a.,  and  (Ax/Az). 

In  order  to  advance  u11  by  one  time  step  it  is  necessary  to  calculate 
D,  then  E,  then  S,  then  p,  then  F11,  and  finally  u0^.  The  operation 
count  for  one  time  step  can  be  obtained  by  writing  the  finite  difference 
equations  for  each  of  these  quantities,  finding  where  the  required  data 
are  stored,  counting  the  number  of  transfers  needed  to  get  these  data  into 
the  registers  in  cell  (i,j)  and  the  number  and  type  of  arithmetic  operations 
required . 

It  turns  out  that,  as  was  expected,  the  entire  calculation  is  largely 
local  in  that  most  cell-to-cell  transfers  are  between  nearest  neighbors. 
Figure  6 shows  cell  (i,j)  and  its  neighbors.  All  the  data,  and  only  the 
data,  required  to  calculate  Fa  , , F?  . , , and  F . , , are  shown  in 

this  figure  in  the  cells  where  they  are  stored.  It  can  be  seen  that  twenty 
words  of  data  must  be  transferred  to  cell  (i,j);  of  these  twenty  only  two, 

Ui+ir,  j-l,k  and  Vi-1,  j4*,k’  requlre  3 two~steP  t^nsfer. 

C-6.  THF  POISSON  PROBLEM 

As  one  might  expect,  the  most  difficult  part  of  the  calculation  is  the 
solution  of  the  Poisson  equation  with  Neumann  boundary  conditions.  This 
problem  is  elliptic,  i.e.,  nonlocal.  A direct  solution,  say  by  using 
a fast  Fourier  transform  (FFT)  in  one  direction  and  a tridiagonal 
equation  solver  in  the  other,  is  prohibitive  because  of  the  transfer 
time  cost.  More  important,  these  methods  are  restricted  to  flows  with 
very  regular  geometry;  general  geometries  require  the  use  of  relaxation 
methods . 


Consider  a model  problem. 


rP . il  + . Q 


in  Os  x Si,  Oszsl;  Q given  and 


n • VP  = 0 
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wit  h n a unit  vector  normal  to  the  boundary  on  x = 0,  x = 1,  z = 0,  and 


Successive  over-relaxation  (SOR)  is  a standard  sequential  method. 

T , , * , k+1  .k+1  . , , k+) 

It  is  sequential  because  p.  , . and  p.  . , must  be  known  beiore  p.  . 

i-1 . j i,J-l  l.J 

can  be  calculated.  If  it  were  used,  the  relaxation  would  sweep  through 

the  cell  array  one  cell  at  a time,  thus  losing  all  the  advantages  of 

parallelism  inherent  in  the  cell  array.  The  Jacobi  method  is  inherently 

k+1 

parallel  because  all  cells  are  relaxed  in  parallel,  i.e.,  p^  . is  ob- 
tained for  all  i and  j using  the  data  from  the  previous  iteration.  How- 
ever, the  Jacobi  method  has  a linear  convergence  factor  as  compared  to 
the  quadratic  coverage  factor  of  SOR. 

The  Jacobi  method  can  be  accelerated  to  obtain  the  quadratic  con- 
vergence factor  of  SOR  while  remaining  parallel.  The.  cells  are  assumed 
to  be  "reordered."  All  those  cells  with  i + j even  are  relaxed  using  an 

over-relaxed  Jacobi  method.  This  gives  p.  for  i + j even.  Then  all 

* J k -fJ-'  k-f1-' 

those  cells  with  i + j odd  are  relaxed  using  the  values  of  p 2 . , p,  2 , 

k+Jj  k+''  1-1»J 

p.  . . , and  p 2 . This  odd/even  Jacobi  over-relaxation  is  parallel 

t y J'*’  1 l » J “ 1 

and  has  the  quadratic  convergence  factor  of  SO R.  It  is,  in  fact,  SOR 
with  red-black  ordering. 

As  was  pointed  out  by  King,^^  this  does  not  cost  twice  as  many 
iterations  as  SOR  but  only  one  more.  To  see  this  consider  the  sequence: 


k = 0;  turn  off  all  cells  with  i + j odd. 

U 

P.  j is  computed  for  all  cells  with  i + j even. 

k = 1;  turn  the  (i+j)  odd  cells  on  and  leave  the  (i+j)  even  cells  on. 
pj  . is  computed  for  all  cells  with  i + j odd. 

i. 

P2  . is  computed  for  all  cells  with  i+j  even. 


In  all  subsequent  iterations  all  cells  are  on  and  the  odd  cells  are  one 
step  ahead  of  the  even  cells.  Therefore  if  K sequential  sweeps  are  re- 
quired for  convergence  of  SOR,  then  only  K+1  parallel  sweeps  are 
required  for  red-black  SOR. 

The  results  of  some  numerical  experiments  are  given  in  Table  C-l. 

The  Q.  were  chosen  to  be  random  variates  in  (-1,1)  with  zero  mean. 

■ > J 
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Table  C-l 

RESULTS  OF  THE  RELAXATION  SOLUTION  TO  V2P  = Q 
NEUMANN  BOUNDARY  CONDITIONS 
(N+2)  x (N+2)  MESH 


S 0 R 


N 

e 

0) 

K 

1.0 

173 

1.1 

110 

1.2 

111 

10 

io~4 

1.3 

105 

1.4 

7 6 

1.5 

79 

1.6 

61 

1.7 

55 

1.8 

39 

1.9 

57 

-4 

20 

10 

1.8 

121 

-4 

50 

10 

1.8 

489 

J A C 

0 B I 

10 

10“4 

1.0 

263 

i m 

10'4 

1.0 

740 

Sfl 

10-4 

1.0 

3865 

R E 

D — B L 

A C K S 

0 R 

1.0 

166 

1.1 

103 

1.2 

105 

-4 

1 .3 

102 

10 

10 

1.4 

77 

1.5 

73 

1.6 

56 

1.7 

45 

1.8 

33 

i 

1.9 

51 

-4 

20 

10 

1.8 

104 

-4 

50 

10 

1.8 

463 

. 
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In  this  table  £ is  the  error  in  the  solution,  i.e.,  if  r. . is  the 

i.) 

residua  L 


r . . 


convergence  occurs  when 
Max  I r . . | £ £ 


(27) 


The  relaxation  factor  is  cj,  and  K is  the  number  of  iterations  required 
to  converge.  Note  that  red-black  SOR  converges  in  slightly  fewer  iter- 
ations than  SOR. 

It  has  been  assumed  that  the  flow  is  periodic  in  z for  the  test 
simulation  on  this  array  computer.  This  suggests  that  p and  Q be 
Fourier-transformed  in  the  z direction.  It  is  obvious  that  this  con- 
verts one  three-dimensional  Poisson  problem  over  L*M*N  points  to  N two- 
dimensional  Poisson  problems  each  over  L*M  points.  It  is  assumed,  for 
this  test  case,  that  only  a fraction,  0 < f < 1,  of  the  Fourier  modes 
require  relaxation.  If  all  Fourier  modes  contained  appreciable  amounts 
of  energy,  there  would  be  an  aliasing  problem;  in  effect  the  resolution 
in  z is  too  coarse. 

Two  points  must  be  made.  The  first  is  that  the  Fourier  decomposi- 
tion is  not  crucial.  This  decomposition  only  affects  the  order  in  which 
the  calculation  is  done.  If  we  Fourier-transf orm,  the  relaxation  is  over 
all  (i,j)  for  each  of  N modes;  mode  zero  is  relaxed,  then  mode  one,  then 

mode  two,  etc.  If  the  z boundary  conditions  were  different  so  that  we 

could  not  Fourier-transf orm  in  z,  we  would  be  relaxing  for  all  (i,j)  for 

K = 1,  2,  ....  N and  then  back  to  K = 1,  etc.  until  all  p.  . , are  de- 

1 * J » k 

termined.  The  only  difference  is  the  order  of  the  operations — not  the 
number  of  operations. 

The  second  point  is  that  a "standard"  relaxation  method  (modified 
slightly  for  a cell  computer)  has  been  assumed.  We  have  not  yet  examined 
other  methods  such  as  the  cell-by-cell  divergence-pressure  method  of 
Hirt  and  Cook^*^  or  even  the  use  of  block  or  multigrid  relaxation 
possibly  to  increase  the  convergence  rate. 


( 


i 


methods 
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C-7 • OPERA T ION  COUNT  FOR  ONE  TIME  STEP 

The  number  of  in-cell  transfers,  cell-to-cell  transfers,  additions, 
and  multiplications  necessary  for  each  phase  of  the  calculation  to  ad- 
vance u by  one  time  step  are  given  in  Table  C-2,  and  the  totals  for  one 
time  step  are  summarized  in  Table  C-3.  In  these  tables,  N is,  of  course, 
the  number  of  mesh  points  in  the  transverse  flow  direction  (z  direction), 
and  the  product  fK  is  the  number  of  iterations  for  relaxation  of  the 
Poisson  equation. 

From  an  examination  of  these  tables,  assuming  that  fK  = 50  and 
N = 64,  say,  several  facts  become  apparent.  First  (Table  C-2)  operations 

(I)  through  (9),  which  constitute  75  percent  to  80  percent  of  the  total 
transfers,  additions,  and  multiplications,  are  required  to  calculate  the 
pressure  field.  The  calculation  of  the  velocity  field,  operations  (10), 

(II) ,  and  (12),  constitute  only  about  20  to  25  percent  of  the  operations. 
Second,  about  60  percent  of  the  transfers  but  only  20  percent  of  the 
additions  and  10  percent  of  the  multiplications  are  required  to  apply 
the  boundary  conditions.  Third,  if  we  take  the  in-cell  transfer  count 

as  unity,  the  cell-to-cell  transfer  count  is  about  7,  the  addition  count 
14,  and  the  multiplication  count  is  about  5.  Because  it  is  generally 
true  that  the  in-cell  transfer  time  and  the  addition  time  are  consider- 
ably smaller  than  the  cell-to-cell  transfer  time  and  the  multiplication 
time,  it  is  clear  that  the  total  calculation  time  is  controlled  by  the 
cell-to-cell  transfer  and  multiplication  counts  and  times.  Fourth,  the 
ratio  of  the  multiplication  count  to  cell-to-cell  transfer  count  is 
about  one,  so  reducing  either  cell-to-cell  transfer  time  or  multiplica- 
tion time  to  zero  would  result  in  a speedup  of  only  a factor  of  two. 

08.  PERFORMANCE  OF  AND  MKMORY  REQUIRED  FOR  SOME  SAMPLE  PROBLEMS 

In  order  to  gain  some  insight  into  the  performance  capabilities  of 
this  type  of  array  computer,  three  sample  problems  will  be  considered: 

(1)  A (logical)  array  of  50  x 200  cells  with  N = 32;  50  points 
in  the  direction  normal  to  the  boundary,  200  in  the  down- 
stream direction  and  32  in  the  cross-stream  direction.  It 
is  believed  that  this  "one-third  of  a million  point"  problem 
is  the  absolute  minimum  problem  of  interest. 
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Table  C-2 

OPERATION  COUNT  FOR  EACH  PHASE  OF  ONE  TIME  STEP 


Phase 

Calculation 

In-Cell 

Transfers 

Ce 1 1-to-Cel 1 
Transfers 

Additions 

Multipli- 
cat ions 

1 

°ijk 

N 

2N 

5N 

3N 

2 

Eijk 

2N 

2N 

14N 

5N 

3 

S.  ..  & Q.  .. 
ijk  xijk 

15N 

23N 

50N 

3 ON 

4 

Four ier-Trans form 

^ijk  & Pi ik  over  k, 
i . e. , z direction 

2N 

0 

4NC^N 

NC^N 

5 

Pressure-Boundary 
Condit ions 

5N 

7N 

27N 

14N 

6 

Four ier-Trans form 
Pressure- Boundary 
Conditions 

4N 

0 

NC^N 

2Nf.^N 

7 

Relax  p. , 
i ja 

0 

2f  KN 

6fKN 

3f  KN 

8 

Apply  Boundary  Con- 
ditions on  p , , 
i ja 

0 

4f  KN 

2f  KN 

0 

9 

Inverse  Fourier- 
Transf orm  p , . 

_ ija 

2N 

0 

^Nf-^N 

Nf^N 

10 

Fn 

(3  Components) 

1 IN 

16N 

94N 

38N 

11 

n 

u 

(3  Components) 

3N 

0 

12N 

3N 

12 

Apply  Boundary 
Conditions  on  un 

3N 

13N 

6N 

0 

f = Fraction  of  Fourier  modes  containing  energy  (f  = ?). 

K = Number  of  Iterations  required  for  the  relaxation  solution  for 
to  converge  (K  = 50  ?). 

N = Number  of  mesh  points  in  the  transverse  flow  direction  (z  direc- 
tion); also  number  of  Fourier  modes  in  z. 
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Table  C-3 

TOTAL  OPERATION  COUNT  FOR  ONE  TIME  STEP 


IN-CELL  TRANSFERS  = 48N 
CELL-TO-CELL  TRANSFERS  = (63  + 6fK)N 


ADDITIONS  = (208  + 8fK  + 2i^N)N 


MULTIPLICATIONS  = (93  + 3fK  + AT^N)N 


Number  of  words  of  memory  = 18  + 12N 
Number  of  registers  = 30 


f = Fraction  of  Fourier  modes  contain- 
ing energy  (f  = \ ?). 

K = Number  of  iterations  required  for 
the  relaxation  solution  for  one 
Fourier  mode  of  the  pressure  field 
to  converge  (K  = 50  ?). 

N = Number  of  mesh  points  in  the  trans- 
verse flow  direction  (z  direction); 
also  number  of  Fourier  modes  in  z. 


(2)  A (logical)  array  of  50  x 200  cells  with  N = 128;  50  points 
normal  to  the  boundary,  200  in  the  downstream  direction  and 
128  across  the  stream.  The  "million  point"  problem  is  quite 
interesting. 

(3)  A (logical)  array  of  100  * 1000  cells  with  N = 128;  100  points 
normal  to  the  boundary,  1000  points  in  the  downstream  direc- 
tion and  128  points  across  the  stream.  This  "ten  million 
point"  problem  is  very  interesting. 


In  order  to  calculate  computation  time  per  time  step  it  is  neces- 
sary to  make  assumptions  about  the  in-cell  transfer,  the  cell-to-cell 
transfer,  the  addition,  and  the  multiplication  times  as  well  as  to  assume 
values  for  f and  K.  It  will  be  assumed  that 
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( 1 9) 

computer.  This  table  is  reproduced  from  (use  et  .» 1 . In  this  tech- 

nique, the  spatial  differencing  is  the  iam»  >:  that  used  in  the  analysis 
of  the  array  computer's  performance,  ■ . ■ nd-i  rder  entral  differences; 
the  time  differencing  is  the  leap-frog  t \ . >st  1 isson  solver 

is  used.  The  downstream  direct  ion  has  I points,  the  <r  -s-st  ream  direc- 
tion lias  N points,  and  the  direction  normal  t>  t 1 uno  irv  has  M points 
It  will  be  assumed  that , for  the  • nvent i nil  -omputer  > , a CiH 

7600)  , 


Addition  time  = 1 0 sec  = 100  nsec 

Multiplication  time  2 ■ 10  sec  = 200  nsec 
Memory  transfer  time  = 10  b sec  = 1 psec 

The  estimates  for  each  of  the  three  problems  are  given  in  Table 
C-5.  These  estimates  are  broken  down  in  several  ways  (all  times  are 
in  seconds).  Column  one  lists  the  problems.  The  total  number  of  bits 
of  memory  required  for  each  problem  is  listed  in  column  two.  In  the 
third  column  the  time  to  advance  the  interior  points  one  time  step  is 
given,  while  in  the  fourth  column  is  tile  time  required  to  advance  the 
boundary  points  one  time  step.  The  sum  of  these  is  the  total  time  re- 
quired to  advance  a!',  the  grid  points  <'"<-•  time  step,  and  this  is  given 
in  column  seven.  This  total  time  is  also  the  sum  of  the  time  required 
to  compute  the  pressure  field  (given  in  column  five;  and  tin  time  re- 
quired to  compute  the  velocity  field  (column  six).  Finally,  column 
eight  is  the  total  time  required  to  calculate  one  time  step  on  a con- 
ventional computer.  This  time  was  calculated  from  the  total  operation 
count  given  in  Table  C-4  and  the  addition,  multiplication,  and  memory 
transfer  times  for  the  conventional  computer,  as  tabulated  above. 

From  an  examination  of  Table  C-5,  it  is  i lear  that  most  of  the 
computation  time  of  the  array  computer  is  spent  computing  tiie  pressure 
field  for  the  interior  points  of  the  grid;  the  ratio  of  the  calculation 
time  for  the  interior  to  the  calculation  time  for  the  boundary  is  about 
2,  while  the  ratio  of  the  calculation  time  for  the  pressure  field  to 
that  for  the  velocity  field  is  about  7.  This  holds  true  for  all  three 
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problems.  In  short,  the  dominant  part  of  the  calculation  is  the  relax- 
ation solution  of  the  pressure  field  on  the  interior  of  the  grid. 

Although  it  is  not  shown  in  this  table,  precisely  the  opposite  holds 
true  for  the  calculation  on  the  fast  conventional  computer  using  a fast 
Poisson  solver,  wherein  the  ratio  of  the  velocity-field  calculation  time 
to  the  pressure-field  calculation  time  is  about  3. 

For  problems  (1)  and  (2),  the  ratio  of  total  calculation  time  on  the 

conventional  computer  to  that  of  the  array  computer  is  ~230.  Increasing 

the  number  of  grid  points  in  the  cross-stream  direction  from  32  to  128 

changes  this  ratio  by  less  than  1 percent.  However,  increasing  the 

4 5 

number  of  cells  from  10  to  10  increases  the  ratio  to  about  2370.  As 
was  expected,  the  ten-fold  increase  in  the  number  of  cells  is  fully  re- 
flected in  the  speed  ratio. 

The  problem  with  32  grid  points  in  the  cross-stream,  z,  direction 
requires  432  words  of  memory,  about  14K  bits  per  cell.  Increasing  the 
number  of  grid  points  in  the  z direction  to  128  increases  the  size  of 
the  required  cell  memory  to  1584  words  or  about  50K  bits.  Problem  (1) 
fits  nicely  into  a 16K  bit  memory  and  problems  (2)  and  (3)  are  good  fits 
to  a 64K  bit  memory. 

0-9.  SUMMARY  OF  CONCLUSIONS 

A standard  algorithm,  using  a relaxation  method  to  solve  for  the 
pressure  field,  fits  quite  well  into  an  array  computer.  No  major  modif- 
ications of  the  algorithm  were  required.  The  entire  calculation  is 
dominated  by  the  relaxation  solution  on  the  interior  grid  points  of  the 
Poisson  equation  for  the  pressure  field.  The  number  of  cell-to-cell 

transfers  is  about  equal  to  the  number  of  multiplications. 

4 

A calculation  with  10  cell  processors  and  32  grid  points  in  the 
cross-stream  direction  requires  about  0.09  seconds  per  time  step,  which 
is  about  230  times  faster  than  a fast  conventional  computer.  It  requires 
about  14K  bits  of  memory  per  cell. 

If  the  number  of  grid  points  in  the  cross-stream  direction  is  in- 
creased to  128,  the  calculation  time  per  time  step  increases  to  about 
0.35  seconds,  and  the  required  memory  per  cell  increases  to  about  50K 
bits.  The  ratio  of  the  speed  of  the  array  to  that  of  a conventional 
computer  remains  about  230. 


If  the  cell  array  Is  enlarged  by  a factor  of  10,  i.e.,  10  cell 
processors,  the  calculation  time  and  the  per-cel 1 memory  requirements 
for  tiie  cell  computer  are  unchanged,  but  the  speed  ratio  increases  to 
about  2370,  fully  reflecting  the  increase  in  the  cellular-array  size. 
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